WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 10 — 16 March

Microsoft announces Dragon Copilot, an AI assistant for healthcare, CoreWeave signs $11.9 billion contract with OpenAI ahead of IPO, Musk may still have a chance to thwart OpenAI’s for-profit conversion, Spotify’s report reveals that most artists are not thriving financially, Google DeepMind unveils new AI models for controlling robots, and much more

21 min readMar 17, 2025

--

Photo by Nijwam Swargiary on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

63 stories

Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.

Research

  • The First Few Tokens Are All You Need. Researchers from Tencent AI Lab and The Chinese University of Hong Kong, Shenzhen propose a method to enhance reasoning in large language models (LLMs) by fine-tuning only the first few tokens of generated solutions. This approach focuses on Prefix Self-Consistency, where the initial tokens often share core reasoning steps, making fine-tuning on these prefixes effective. It uses Minimal Token Training, which reduces computational cost by up to 16 times compared to full-chain fine-tuning while maintaining reasoning structure. Despite being unsupervised, this method performs as well as or better than more computationally intensive methods. It works across various LLM architectures and can scale from small to large datasets, with the option to incorporate ground-truth checks to improve accuracy.
  • Cognitive Behaviors that Enable Self-Improving Reasoners. Researchers from Stanford University and collaborators examine why some language models excel in reinforcement learning (RL)-based self-improvement, while others plateau. They identify four key cognitive behaviors — verification, backtracking, subgoal setting, and backward chaining — that are crucial for problem-solving in both humans and models. The study finds that models exhibiting these behaviors, like Qwen-2.5–3B, perform better in RL tasks than those that don’t, like Llama-3.2–3B. Introducing cognitive behaviors through priming also boosts performance, with reasoning patterns playing a significant role. Curating pretraining data to emphasize these behaviors can enhance model performance, even for those initially underperforming. These cognitive behaviors also generalize to other reasoning tasks, suggesting that targeted priming and pretraining modifications can greatly improve a model’s ability for self-improvement.
  • Forecasting Rare Language Model Behaviors. A team from Anthropic and collaborators developed a method to predict rare failures that may only emerge at deployment scale, allowing developers to address issues early. They estimate the risk of undesirable behavior by sampling multiple outputs and measuring the likelihood of harmful responses, even from seemingly safe prompts. The study reveals that the probability of worst-case failures increases with the number of queries sampled, enabling prediction of extreme risks from smaller-scale tests. They introduce metrics like worst-query risk, behavior frequency, and aggregate risk, which can be extrapolated to larger-scale deployments. The approach also improves red-teaming by identifying which models or sampling strategies are most effective at uncovering potential failures, optimizing resources before models face billions of queries.
  • Differentiable Logic Cellular Automata. A team from Google’s Paradigms of Intelligence introduces a discrete version of Neural Cellular Automata (NCA) by replacing floating-point layers with Differentiable Logic Gate Networks. This approach uses binary vectors for each cell’s state, updated by learned logic circuits, enabling interpretable local rules and end-to-end differentiable training. Unlike traditional NCAs that rely on continuous neurons, this model uses learnable AND/OR/XOR gates, converted to binary gates for inference. The system successfully replicates Conway’s Game of Life and can generate complex patterns like checkerboards and images. It also demonstrates fault tolerance and supports asynchronous updates. This discrete, interpretable framework shows promise for robust, flexible computing in areas like programmable matter.
  • How Well do LLMs Compress Their Own Chain-of-Thought? This paper explores how large language models (LLMs) balance the length of chain-of-thought (CoT) reasoning with accuracy. It introduces the concept of token complexity, which represents the minimum number of tokens required to solve a problem correctly. The study shows that various CoT “compression prompts,” like “use bullet points” or “remove grammar,” follow the same universal accuracy-length trade-off curve, indicating that reasoning length, not formatting, primarily influences accuracy. The authors also highlight that if the CoT falls below the token complexity threshold, the model fails. They propose that CoT compression can be seen as a “lossy coding” problem, with current prompting methods far from theoretical limits, leaving room for improvement. The optimal approach would involve adapting CoT length based on problem difficulty, using fewer tokens for easier tasks and more detailed reasoning for complex ones.
  • LADDER: Self-Improving LLMs Through Recursive Problem Decomposition. LADDER is a framework that enables LLMs to recursively generate and solve simpler versions of complex problems, improving math integration accuracy. It allows models to autonomously create easier problem variants and use reinforcement learning with a verifier, establishing a self-guided learning process without needing human feedback or curated datasets. The framework introduces Test-Time Reinforcement Learning (TTRL), where problem variants are generated during inference, refining solutions on simpler tasks to increase final accuracy (e.g., improving from 73% to 90% on the MIT Integration Bee). LADDER uses generalizable numeric verification, allowing its application in fields like code testing or theorem proving, where straightforward checks are available.
  • Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems. This paper introduces Agentic Reward Modeling, a new reward framework that combines human preference models with “verifiable correctness” signals to provide more reliable rewards for training and evaluating LLMs. The framework uses a modular system, REWARDAGENT, which includes a router to determine necessary checks (e.g., factual accuracy, adherence to instructions), specialized verification agents for factual checks, and a judger that merges these signals with human preference scores. The system improves factual accuracy by comparing candidate responses and verifying differences through evidence, reducing costs. It also ensures instructions are followed by using auto-generated Python scripts for constraint checking, penalizing violations. REWARDAGENT outperforms existing models on challenging benchmarks and real-world tasks, offering significant improvements in accuracy and reliability when used for best-of-n search or DPO training.
  • Fractal Generative Models. Researchers from MIT CSAIL and Google DeepMind introduce a fractal-based framework for generative modeling, where atomic generative modules are used recursively. This approach, which abstracts autoregressive models into modular units, efficiently handles high-dimensional data like raw pixels. The fractal method achieves state-of-the-art performance on ImageNet 64×64, outperforming previous methods, and can generate high-quality 256×256 images. It also allows for tasks like inpainting and semantic replacement. The design reduces computational costs, making pixel-by-pixel generation feasible at larger resolutions, and is open-sourced for wider use.
  • Visual RFT. One of the trends is the use of simple verifiable rewards and scaled reinforcement learning. This paper successfully applies that approach to vision-language models.
  • Implicit Reasoning in Transformers is Reasoning through Shortcuts. This project examines how language models perform step-by-step implicit reasoning but face challenges with generalization, particularly when handling expressions that include variables as subtrahends.
  • Generalized discrete diffusion. This work expands diffusion on discrete data, like text, by introducing a generalized denoising process and a slightly enhanced masking scheme. This combination improves training efficiency and enables the model to correct its own output.

News

Resources

  • Smalldiffusion. A minimal, readable, and performant toolkit for training and sampling from diffusion models.
  • LLM Post-Training: A Deep Dive into Reasoning Large Language Models. This survey examines methods for improving LLMs post-pretraining, including fine-tuning, reinforcement learning, and optimizing inference techniques. It also addresses challenges such as catastrophic forgetting, reward manipulation, and ethical concerns, providing a guide for developing more reliable and advanced AI systems.
  • Crossing the uncanny valley of conversational voice. Researchers from Sesame introduce a multimodal TTS approach designed for natural, context-aware speech in real-time conversational AI. Unlike traditional TTS, which lacks contextual awareness, this method addresses the “one-to-many” problem by conditioning on conversation history, speaker identity, and prosodic cues. The end-to-end model uses Residual Vector Quantization (RVQ) tokens and two autoregressive transformers for efficiency and expressivity, with a lightweight decoder to reduce computational load. Despite training on only a fraction of frames, the model maintains high fidelity. Evaluations show near-human accuracy in word error rates and speaker similarity, with scaling improving speech realism. However, challenges remain in capturing nuanced human prosody in conversational contexts. The team plans to release their models open-source and expand to more languages while refining conversational dynamics.
  • Applications of Large Models in Medicine. Medical AI is progressing beyond basic diagnostics, with large models reshaping healthcare. A recent paper categorizes Medical Large Models (MedLMs) into clinical text analysis, medical imaging, anatomical representation, and multimodal systems. It also explores Large Graph Models, which can interpret complex biomedical relationships, offering significant potential. These models are improving diagnostic accuracy and transforming treatment planning and drug discovery. While the medical field has been cautious about AI, these advancements suggest we may be nearing a tipping point where their clinical value becomes undeniable.
  • Deriving Muon. Adam has been the leading optimizer in deep learning for years. However, recently, the community has discovered that Muon could be a promising alternative. It achieves many of the same results as muP without needing any changes to the model. This post outlines some of the theoretical reasons behind the optimizer.
  • Optimal Hyperparameter Scaling Law in Large Language Model Pretraining. Step Law is a comprehensive optimal hyperparameter scaling law that applies to various model structures, architectures, and data distributions. This allows predictions about how models are likely to perform before the training process even begins.
  • Time-Series Forecasting. SeqFusion is a framework that sequentially chooses and combines pre-trained models for zero-shot forecasting. In contrast to traditional methods, it reduces data usage to improve privacy while still delivering strong accuracy across various temporal patterns.
  • Distractor Aware SAM . Segment Anything (SAM) is a top-tier model for visual analysis and segmentation. However, it can struggle when two similar-looking objects appear in a video. This new approach addresses these “distractors” by incorporating extra memory augmentation and training.
  • Autoregressive Streaming Text-to-Speech Model for Any LLM. A compact 30 million parameter model designed to enhance any language model, enabling it to comprehend and generate speech in response to general queries. Importantly, it doesn’t require adjustments to the base model, making it easily transferable across different models.
  • Federated Learning for Neural Feedforward Control. This project presents a federated learning-based method for neural feedforward control, enabling multi-agent systems to enhance tracking performance while maintaining data privacy.
  • Gemini Embedding Model. The Gemini team has developed and released an outstanding embedding model for text. It leads in benchmark performance, is cost-effective, and offers excellent speed.
  • Token-Efficient Long Video Understanding for Multimodal LLMs. Most video understanding models process individual frames, which makes addressing temporal questions challenging. STORM, which leverages Mamba adapters, introduces temporal attention operations. This post compares it with Qwen models.
  • Video Painter. A new video inpainting model, VideoPainter, effectively integrates background information, supports videos of any length, and utilizes a dedicated dataset and benchmark for training and evaluation. Its design goes beyond basic inpainting, offering potential for advanced video manipulation and the generation of related training data.
  • Detecting misbehavior in frontier reasoning models. This report from OpenAI discusses monitoring the chain of thought in advanced reasoning models. Frontier reasoning models take advantage of loopholes when possible. It demonstrates that an LLM can be used to detect these exploits in their chains of thought. Penalizing their “bad thoughts” doesn’t eliminate most misbehavior — it simply causes the models to conceal their intentions.
  • Flying Safer: Obstacle Avoidance for Fast Drones. This repository includes the implementation of a lightweight deep reinforcement learning-based collision avoidance system for fixed-wing unmanned aerial vehicles (UAVs), using AirSim and JSBSim.
  • Teaching Language Models to Solve Sudoku Through Reinforcement Learning. This research investigates training AI language models to solve Sudoku puzzles using reinforcement learning, specifically applying Group Relative Policy Optimization (GRPO) to models like Qwen 2.5, without the need for external data or larger model distillation. A multi-faceted reward system was developed, focusing on correct answer formatting, proper grid structure, and accurate solutions, to help the models learn the logical rules and spatial reasoning required for Sudoku, transforming them from text predictors to structured problem-solvers.
  • Hugging Face Expanding LeRobot Platform. Hugging Face and Yaak have released L2D, the largest open-source multimodal dataset for automotive AI. It features driving policies from both experts and students collected from driving schools, along with natural language instructions to improve spatial intelligence models.
  • MovieAgent: Automated Movie Generation via Multi-Agent CoT Planning. This system combines multiple generative modalities and employs persona-based prompting to promote consistency and accuracy. It then utilizes the Stable Diffusion video model to generate and assemble frames. This process could likely be enhanced with the use of Wan.
  • Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning. Building embedding models for vision and language tasks using a contrastive loss often causes these models to struggle with hard negative pairs. This work introduces a regularization strategy and reports significant improvement in challenging retrieval tasks. The method also scales effectively for zero-shot video retrieval.
  • YoloE: real-time open vocabulary detection. Small vision models can be prompted in various ways for open vocabulary detection. This allows the use of classes, images, and text to guide the model on what to detect. Notably, it operates at 300fps, making it suitable for real-time applications.
  • Perception efficient reconstruction. Another approach combines textual query capabilities with 3D reconstruction from images. This specific system utilizes a feed-forward model for fast reconstruction.
  • DeepMind’s Image-Text Model. DeepMind has unveiled TIPS, an innovative image-text model designed for dense and global vision tasks. By combining contrastive learning with masked image modeling and leveraging synthetic captions, it demonstrates strong spatial awareness and surpasses existing methods in several benchmarks.
  • The TechCrunch AI glossary. This article defines key AI terminology, such as “AI agents,” chain-of-thought reasoning, deep learning, and large language models (LLMs). Deep learning is explained as a subset of machine learning inspired by neural pathways in the human brain, while LLMs are described as neural networks that power AI assistants like ChatGPT. The article also covers fine-tuning and the role of weights in optimizing AI models.
  • Gemma 3: Google’s new open weights model. Google has released the weights and a technical report for the Gemma 3 model. Available in four sizes, this model performs similarly to Gemini 1.5 Pro and has a robust understanding of over 140 languages. It appears to be close to the state-of-the-art in dense models.
  • Gemma 3 technical report. the gemma 3 technical document
  • Predicting Future Features in Diffusion Models. TaylorSeer introduces a technique to predict future timestep features in diffusion models using Taylor series expansion, greatly minimizing errors in feature caching.
  • Super-Resolution with Token Aggregation. CATANet enhances image super-resolution by aggregating long-range content-similar tokens.
  • Multimodal Motion Generation. Motion Anything improves conditional motion generation by introducing Attention-based Mask Modeling for better spatial and temporal control.
  • Dual-Stream Video Inpainting. VideoPainter introduces a dual-stream architecture for video inpainting, greatly simplifying learning complexity while enhancing background preservation and object generation. It also offers VPData and VPBench, the largest video inpainting dataset and benchmark to date.
  • PromptPex. PromptPex is a developer tool that treats prompts as functions and automatically generates test inputs, allowing for systematic unit testing of AI model prompts.
  • Inductive Moment Matching. A well-executed unification and simplification of diffusion models for continuous data. They employ a new moment matching framework to develop a few-step generative model.
  • OpenR1 Update 3. The Hugging Face team has announced the next update for its open replication of the DeepSeek reasoning model. It discovered that smaller models can outperform larger ones when fine-tuned specifically for competitive coding.
  • Rotate LoRA for Wan. Wan is a generative image model that has recently been open sourced.
  • Flat Color LoRA. Another LoRA for the Wan video model.
  • Small Reasoning Models with Two-Stage Rule-Based RL. By dividing the reasoning training process into two steps, this work demonstrates that small models can still generalize across tasks and even different modalities.
  • Cohere Command A Model . Cohere has trained and released an open weights model with 111B parameters. It performs exceptionally well in agentic, multilingual, and coding tasks. Additionally, it has been specifically optimized for enterprise applications like retrieval.
  • Generate Motion for Arbitrary Characters. AnyMoLe generates motion between frames for arbitrary characters using video diffusion models, removing the need for character-specific datasets.
  • Multimodal Representation Learning. MMRL improves few-shot adaptation of vision-language models by introducing a shared representation space, enhancing multi-modal interactions while preserving generalization.
  • Audio Flamingo 2. A new state-of-the-art audio understanding model built on Qwen with almost entirely synthetic data.
  • Agent S2: An Open, Modular, and Scalable Framework for Computer Use Agents. Agent S is a robust and open computer use system. It has achieved state-of-the-art performance in browsers, system operations, and even mobile tasks.
  • Unified Visual Decoding. REF-VLM unifies visual decoding tasks in multimodal LLMs using a structured triplet-based representation.
  • Open-Sora: Democratizing Efficient Video Production for All. The Open Sora initiative, which has been ongoing since the original launch of the model, has developed a competitive model for under $200k. It has released all the code and weights to allow others to reproduce the results. The motions are impressive, though not entirely state-of-the-art.

Perspectives

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

--

--

Salvatore Raieli
Salvatore Raieli

Written by Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence

No responses yet