WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 11–17 November

OpenAI faces AI advancement slowdown, Near plans world’s largest open-source AI model, Microsoft adds AI to Notepad and Paint, AlphaFold3 goes open-source, Google accidentally previews Jarvis AI, and much more

Salvatore Raieli
20 min readJust now
Photo by Fujiphilm on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

49 stories

Research

  • Project Sid: Many-agent simulations toward AI civilization. This work illustrates the behavior and evolution of societies composed of 10–1000+ AI agents. It introduces PIANO, an architecture that allows agents to interact with both humans and other agents in real time. The study reveals that agents can autonomously adopt specialized roles, follow and modify collective rules, and participate in cultural and religious transmissions.
  • Mixtures of In-Context Learners. utilizes subsets of demonstrations to train experts through in-context learning; a trainable weighting function is then employed to merge the next-token predictions from these experts based on the training set. This method is compatible with black-box LLMs, as it does not require access to their internal parameters. Key advantages include: 1) being competitive with standard ICL while offering much greater efficiency in terms of data, memory, and computation, and 2) demonstrating robustness to noisy demonstrations and label imbalance.
  • Attacking Vision-Language Computer Agents via Pop-ups. demonstrates that incorporating adversarial pop-ups into current agent testing environments results in an attack success rate of 86%, reducing the agents’ task success rate by 47%. It also notes that simple defense methods, like instructing the agent to ignore pop-ups, prove ineffective.
  • Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models. enhances LLM responses by simulating multiple experts and combining their outputs; it directs an LLM to complete input instructions by simulating several experts and choosing the best response from both individual and aggregated perspectives. This approach sets a new state-of-the-art on TruthfulQA-Generation with ChatGPT, surpassing the previous record of 87.97%. Additionally, it improves performance in terms of factuality and usefulness while reducing toxicity and hurtfulness.
  • Number Cookbook: Number Understanding of Language Models and How to Improve It. offers a thorough analysis of the numerical understanding and processing ability (NUPA) of LLMs; reveals that while naive finetuning significantly boosts NUPA on many tasks, it doesn’t work for all. It also finds that methods specifically developed to improve NUPA are ineffective when finetuning pre-trained models. The study examines the application of chain-of-thought techniques to NUPA and notes that these methods encounter scalability issues, limiting their practical use.
  • WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning. introduces a self-evolving online curriculum RL framework aimed at closing the performance gap between open and proprietary LLM-based web agents. It boosts the success rate of Llama-3.1–8B from 4.8% to 42.4% and GLM4–9B from 6.1% to 43%, with the open models significantly outperforming GPT-4-Turbo (17.6%) and GPT-4o (13.9%). The framework addresses the limited availability of web agent training tasks using a robust outcome-supervised reward model for task success evaluation. An adaptive RL strategy manages distribution drift in online learning, ensuring steady performance improvements.
  • Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation. introduces a two-stage fine-tuning method where LLMs first learn from tool-generated solutions and then are trained to decide when to solve problems independently versus using tools. Evaluations on benchmarks in math, climate science, and epidemiology demonstrate significant gains, with a 28% increase in accuracy and a 14% improvement in tool usage precision over top models like GPT-4 and Claude-3.5. This approach enables the LLM to flexibly handle scientific problems of varying complexity.
  • Google’s Flood Forecasting AI to Reach 700 Million People. Google is expanding riverine flood forecasting coverage to over 100 countries and 700 million people, and enabling partners and researchers to better understand flood forecasting through more data and the development of a new API
  • Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models. The Mixture-of-Transformers (MoT) architecture features a sparse multi-modal transformer that separates parameters based on modality (text, images, and speech), allowing for efficient processing while preserving performance. In various evaluations, such as Chameleon 7B and Transfusion settings, MoT matches or outperforms dense baselines, utilizing significantly fewer resources — only 37.2% of the FLOPs for speech processing and 47.2% of the wall-clock time for image generation.
  • Exploring the Alignment Landscape: LLMs and Geometric Deep Models in Protein Representation. This study investigates methods to enhance alignment between LLMs and protein-focused geometric deep models, aiming to improve cross-modal understanding.
  • Can LLMs Follow Threads Through Near-Million-Scale Haystacks? Large Language Models (LLMs) with extended context windows support a wider range of applications. Recent research on 17 top LLMs shows that although many can manage multiple information threads simultaneously, their practical context limits are often shorter than the stated maximum. While several models demonstrate “thread safety” by handling concurrent threads without a drop in performance, accuracy typically decreases as the context window approaches its upper limit.
  • Compressing Mesh Data for 3D Generation. By reducing the mesh sequence length by about 75%, a mesh compression method known as Blocked and Patchified Tokenization (BPT) effectively produces meshes with more than 8k faces.
  • Successor Feature Matching. A new non-adversarial method for inverse reinforcement learning that avoids reward function learning is called Successor Feature Matching.
  • Oasis: A Universe in a Transformer. A 500M parameter foundation model without a game engine powers Oasis, a fully AI-generated, real-time open-world video game model. It is tailored for Etched’s Sohu ASIC to achieve great frame rate efficiencies and uses quick transformer inference to generate gameplay. Despite showing great promise, issues like long-context consistency and domain generalization still exist.
  • OpenAI to present plans for U.S. AI strategy and an alliance to compete with China. OpenAI’s AI infrastructure blueprint suggests establishing AI economic zones and collaborating with the U.S. Navy on nuclear energy to promote AI-driven economic growth and innovation. The proposal features a North American AI alliance and initiatives modeled after the National Interstate and Defense Highways Act to address infrastructure demands. It stresses the importance of investing in U.S. data centers and energy projects to stay competitive with China.
  • Introducing Athene-V2: Advancing Beyond the Limits of Scaling with Targeted Post-training. Athene V2 consists of models built upon Qwen 2.5 72B, optimized for agentic and chat-based workflows, and outperform GPT-4o on several key benchmarks.

News

Resources

  • FrontierMath. Epoch AI has introduced FrontierMath, a benchmark comprising expert-level mathematics problems to assess AI’s mathematical reasoning capabilities. Notably, leading AI models have solved less than 2% of these problems, highlighting the benchmark’s difficulty and the current limitations of AI in advanced mathematical reasoning.
  • BitNet a4.8: 4-bit Activations for 1-bit LLMs. A major challenge with 1.58bit LLMs has been the absence of hardware acceleration support. This research introduces 4.8bit activations to leverage the INT4/FP4 kernels available in new hardware, achieving this with no added runtime cost.
  • LLM2CLIP. LLM2CLIP combines CLIP’s visual and textual alignment with the advanced language understanding of LLMs.
  • Torch Compatible Muon Optimizer. Muon is the optimizer that sets the training record for GPT-2. It is a momentum-adapted method similar to SGD. This repository provides an implementation that can be easily used as a replacement for AdamW.
  • Mochi video model with optimized inference. Mochi 1, an open-source text-to-video model, initially required eight H100 GPUs for operation. Thanks to community efforts, it can now run on a single 48GB L40 GPU without compromising quality.
  • A trainable PyTorch reproduction of AlphaFold 3. Protenix is a functional and trainable reproduction of AlphaFold 3, DeepMind’s protein folding project, developed by ByteDance’s ‘AI for Science’ team. This open-source initiative aims to advance protein structure prediction by providing a customizable platform for researchers.
  • LlamaPReview. LlamaPReview is an AI assistant for GitHub that provides easy one-click installation and automatically reviews pull requests with context-aware analysis. It supports various programming languages and integrates seamlessly with GitHub Actions, delivering insightful feedback directly on PRs. Offered for free, it improves code quality by detecting issues and recommending optimizations.
  • SmolLM2. Hugging Face’s SmolLM2 is a compact family of language models, ranging from 135M to 1.7B parameters, trained on 11 trillion tokens. These models are designed to run efficiently on devices and support various tasks. The weights are released under the Apache 2 license, and quantized versions, such as the 1.7GB and 138MB models, offer flexibility to meet different computational requirements.
  • AI for Real-time Fusion Plasma Behavior Prediction and Manipulation. A novel multimodal machine learning approach improves super-resolution data, enabling better analysis of complex fusion plasma phenomena like Edge Localized Modes (ELM), and supports the stabilization of future fusion reactors.
  • A Comprehensive Survey of Small Language Models in the Era of Large Language Models. a review of small language models (SLMs), covering topics such as definitions, applications, improvements, reliability, and related concerns.
  • Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks. A new generalist multi-agent system capable of managing complex web and file-based tasks, featuring an Orchestrator agent that coordinates four specialized agents: WebSurfer for browser tasks, FileSurfer for file management, Coder for programming, and ComputerTerminal for console operations. Magentic-One performs competitively on various benchmarks, such as GAIA, AssistantBench, and WebArena, without needing any changes to its core architecture.
  • Personalization of Large Language Models: A Survey. offers a comprehensive framework for understanding personalized LLMs, introducing taxonomies for various personalization aspects and consolidating existing research in personalized text generation and downstream applications.
  • StdGEN: Semantic-Decomposed 3D Character Generation from Single Images. StdGen is a novel approach for generating 3D characters from a single image. It breaks down the process into distinct components, such as hair and jackets, enhancing the overall quality of the output.
  • alphafold3. DeepMind has open-sourced the code and weights of AlphaFold 3 for academic research, marking a significant advancement in protein structure prediction. This release is expected to accelerate AI applications in scientific research, particularly in molecular biology and drug discovery.
  • Online-LoRA. Online-LoRA is a framework developed to mitigate catastrophic forgetting in online continual learning (OCL) by enabling real-time fine-tuning of pre-trained Vision Transformers (ViTs) without the use of rehearsal buffers.
  • DeepArUco++: Improved detection of square fiducial markers in challenging lighting conditions. DeepArUco++ presents a deep learning-based method for enhancing fiducial marker detection, especially in difficult lighting conditions where traditional techniques typically struggle.
  • Hermes 3. Hermes 3, fine-tuned from Llama 3.1, excels in both reasoning and creativity, showcasing outstanding performance across models with 8B, 70B, and 405B parameters. It introduces new possibilities in AI alignment and artificial consciousness.
  • ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis. To improve the speed and quality of token-based picture production, EfficientNAT is an improved non-autoregressive Transformer model.
  • UniGAD: Unifying Multi-level Graph Anomaly Detection. A novel framework for graph anomaly detection (GAD), UniGAD simultaneously detects anomalies in nodes, edges, and complete graphs.
  • Object and Attribute Matching in Images with Token Merging. Token Merging tackles a prevalent problem in text-to-image models: semantic binding, or the inability to associate things with their particular properties.
  • DataChain. Without abstracting AI models, DataChain is a Pythonic data-frame toolkit for AI that enables effective processing and dataset structuring of unstructured data. It facilitates the creation of metadata, filtering, and vector search by integrating with AI tools like PyTorch, TensorFlow, and LLM APIs. Additionally, the library has built-in vectorized operations on Python object fields, out-of-memory computation, and parallelization.
  • browser-use. Through a streamlined UI, this open-source web automation application enables LLMs to communicate with websites. It is compatible with models such as Claude 3.5 Sonnet and GPT-4o. XPath extraction, customizable actions, and multi-tab management are important features. Data extraction and smooth web navigation are made possible by the program. Message length is one of its drawbacks, as it impacts task repetition and LLM speed. Robustness and cost reduction will be the main goals of further development.
  • CUDA Programming Course — High-Performance Computing with GPUs. A great course from freeCodeCamp on CUDA programming from start to finish.
  • Masked Token Modeling for Zero-Shot Anything-to-Drums Conversion. Zero-shot drum style transfer for any input rhythm presents an exciting music application for artists. This is achieved using a masked token modeling objective, which is particularly effective for audio.
  • HiCoM: Hierarchical Coherent Motion for Streamable Dynamic Scene with 3D Gaussian Splatting. HiCoM is a cutting-edge framework designed to enhance real-time 3D reconstruction from multi-view streaming videos. It effectively addresses key challenges in storage, training speed, and rendering quality, making it a significant advancement in the field.
  • Janus. Janus, DeepSeek’s multimodal model, has a new version incorporating rectified flows, similar to Meta Movie Gen, for image generation and understanding. The results are highly impressive.
  • Link Conversation with Reference Materials. Problem-oriented segmentation & Retrieval (POSR) is a method that breaks conversations into meaningful segments and connects each segment to relevant reference materials, like worksheets or meeting notes.
  • MureObjectStitch: Multi-reference Image Composition.Researchers have presented an improved fine-tuning method for generative image composition, which seamlessly merges a specified foreground object with a new background to generate realistic images.
  • StoryTeller. StoryTeller is a system created to generate coherent descriptions for long videos, tackling issues like plot consistency and character tracking throughout different scenes.
  • SAMPart3D: Segment Any Part in 3D Objects. SAMPart3D, developed by the University of Hong Kong, is a robust method for segmenting 3D objects into semantically meaningful components.
  • Convolutional Differentiable Logic Gate Networks. Researchers have developed a method to train image recognition networks that are 29 times smaller and more efficient than traditional convolutional neural networks (CNNs) by making logic gates differentiable. They have also provided efficient CUDA kernels in their paper release
  • Physics Informed Distillation for Diffusion Models. Physics Informed Distillation (PID) is a method that employs a student model to simplify and accelerate diffusion models by framing them as solutions to differential equations.
  • MinerU: high-quality data extraction tool. MinerU is a robust tool built on StructTable-InternVL2–1B, enabling the extraction of information from PDFs into various machine-readable formats.
  • Isotonic regression. A powerful technique for fitting a monotonic function to data. It can be differentiated really well for a number of applications outside of curve fitting.
  • Text-to-SQL Query. XiYan-SQL is an innovative framework aimed at enhancing both the accuracy and diversity of SQL queries produced from natural language input.
  • X-Portrait 2: Highly Expressive Portrait Animation. ByteDance’s AI group has unveiled X-Portrait 2, an advanced portrait animation technology that transforms static images into highly expressive, realistic videos. Building upon its predecessor, X-Portrait, this new model excels in capturing subtle facial expressions and complex movements, such as pouting, tongue-out gestures, cheek-puffing, and frowning. It achieves high fidelity in emotion preservation, ensuring the generated videos maintain the subject’s identity and emotional nuances.
  • MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views. The MVSplat360 model offers a new way to create realistic 360° views of real-world scenes, even from just a few sparse images.
  • Improved Multi-Task Brain Tumour Segmentation with Synthetic Data Augmentation. This paper presents the leading approach for brain tumor segmentation in the BraTS challenge, demonstrating how synthetic data can improve AI models for medical imaging applications.

Perspectives

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

--

--

Salvatore Raieli
Salvatore Raieli

Written by Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence