WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 7–13 April

21 min readApr 15, 2025
Photo by Markus Winkler on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

75 stories

Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.

Research

  • Rope to Nope and Back Again: A New Hybrid Attention Strategy. Llama 4’s breakthrough in handling over 10 million tokens in context comes from alternating between no positional embeddings and rotational positional embeddings. Although current benchmarks are limited to Needle in the Haystack, they strongly suggest the effectiveness of this alternating layer approach.
  • Inference-Time Scaling for Generalist Reward Modeling. This DeepSeek paper explores using inference-time scaling to improve reward modeling as a way to develop stronger reasoners. It suggests a larger plan by the Chinese start-up to leverage its current reasoning models as a foundation for building the next wave of reward models to train future reasoners.
  • CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation. Researchers at AI2 introduce CodeScientist, a system that autonomously generates and tests scientific hypotheses through code-based experimentation, making validated discoveries with minimal human input. CodeScientist reviews research papers and designs experiments using Python code blocks, following a five-step pipeline: Ideation, Planning, Code Execution, Reporting, and Meta-Analysis. From 50 AI research papers, it proposed 19 findings, with 6 deemed scientifically sound, including insights like the mismatch between LLM confidence and accuracy, the benefit of simpler states for better predictions, and the advantage of graph memory in simulations. While full automation is possible, human feedback enhances the quality of results. Despite successes, over half of experiments fail due to code errors, highlighting the need for peer review and more rigorous methodologies.
  • One-Minute Video Generation with Test-Time Training. This study presents Test-Time Training (TTT) layers with rich hidden states to address the shortcomings of traditional Transformers and models like Mamba in producing long, coherent videos. By adding TTT layers to a pre-trained model, it achieves one-minute video generation from text storyboards that significantly surpass baseline methods in conveying complex narratives, based on human evaluations. Tom and Jerry cartoons serve as the test environment.
  • Scaling Analysis of Interleaved Speech-Text Language Models. This study shows that speech-language models initialized from text models using interleaved training scale more efficiently than models trained solely on speech.
  • Retrieval-Augmented Reasoning Model. RARE introduces a new approach for training domain-specific LLMs focused on reasoning rather than memorization. Inspired by Bloom’s Taxonomy, it emphasizes applying and evaluating knowledge rather than merely recalling facts. RARE separates domain knowledge, retrieved externally, from domain thinking, learned during training, enabling better performance within limited parameter budgets. By using an open-book approach, it injects retrieved knowledge into training prompts, fostering reasoning patterns. This method outperforms standard SFT and RAG, especially in medicine, with small models like Llama-3.1–8B and Qwen-2.5–7B achieving up to 20% higher accuracy on medical QA benchmarks. RARE also uses distillation and adaptive retries to refine outputs and integrate retrieval during training to shape reasoning, replacing memorization with application.
  • A New Batch Normalization. This paper proposes a new batch normalization method for SPD manifolds that uses a learnable Generalized Bures-Wasserstein metric.
  • How Students Use Claude in Education. Anthropic studied one million student conversations to explore AI use in education, finding that STEM students are the primary users, mainly using Claude for content creation, solving technical problems, and tackling advanced learning tasks.
  • Why do LLMs Attend to First Token? This paper explains why LLMs tend to focus attention on the first token, a phenomenon called an attention sink. The theory suggests it prevents representational collapse in deep Transformers. Long contexts and deep layers can lead to over-mixing, causing similar embeddings for all tokens, but attention sinks act as no-ops to preserve representation diversity. Experiments on Gemma 7B and LLaMa 3.1 models show that attention heads fixate on the first token, with larger models requiring stronger sinks. Sinks form naturally due to the token’s position, not its content, and removing the ⟨bos⟩ token after training leads to performance collapse. The paper connects this behavior to Jacobian norm bounds, demonstrating that sinks reduce sensitivity to token changes, and reveals that some attention heads use ⟨bos⟩ as a default unless triggered by specific patterns.
  • MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions. MedAgentSim is an open-source, fully automated hospital simulation where LLM-powered agents simulate doctor-patient interactions in dynamic diagnostic settings. Unlike static QA benchmarks, it involves multi-turn consultations, lab and imaging requests, and iterative diagnosis refinement. The system improves through memory and reflection, using past cases and chain-of-thought reasoning to enhance performance over time. Users can choose to control the doctor or patient agents, and the simulation, built with a 2D game engine, allows for interaction with virtual medical tools. MedAgentSim outperforms baseline setups by 6–37% across several benchmarks, particularly in vision-language tasks, and its bias analysis highlights the importance of cognitive and implicit bias-aware evaluation.
  • Z1: Efficient Test-time Scaling with Code. Z1 is a new method designed to make LLMs more compute-efficient during reasoning at test time. It involves training LLMs on both short and long code-based reasoning trajectories, then adjusting reasoning depth dynamically during inference. The Z1-Code-Reasoning-107K dataset pairs simple and complex coding problems to teach the model when to stop reasoning. A novel test-time strategy, the Shifted Thinking Window, adapts the reasoning token budget based on problem complexity, enabling shallow reasoning for simple tasks and deeper reasoning for complex ones. Z1–7B achieves efficiency gains, matching the performance of larger models like R1-Distill-Qwen-7B but using only 30% of the reasoning tokens. Despite being trained on code-based reasoning, Z1 generalizes well to other domains, outperforming other 7B models across multiple benchmarks. Ablation studies show that longer reasoning paths and larger training sample sizes improve inference quality and accuracy.
  • Inside-Out: Hidden Factual Knowledge in LLMs. This study presents a framework to measure hidden knowledge in LLMs, revealing that models store significantly more factual information internally than they express in outputs, with a difference of up to 40%. It also finds that some answers, while internally known, are never generated, highlighting limitations in test-time sampling for QA tasks.
  • Photonic chips provide a processing boost for AI. Computer processors that exploit both electricity and light could improve the performance of artificial-intelligence systems while consuming less energy.
  • AI Scientist v2. Sakana AI had a research paper accepted to an ICLR workshop that was entirely generated, executed, and written by a language model system. They enhanced the system with vision-language models, broader search capabilities, and other improvements.
  • Dynamic Knowledge Circuits. This research investigates how LLMs internalize new knowledge by examining computational subgraphs, uncovering patterns in knowledge acquisition, training optimization phases, and offering insights for enhancing continual pre-training strategies.
  • Concept Attention. A novel method to view concepts in the attention map of neural networks.

News

Resources

  • Unsupervised Panoptic Segmentation. CUPS is a novel approach to panoptic segmentation that requires no labeled data, using depth and motion cues to learn directly from scene-centric images.
  • Generative Modeling for Crystals. CrystalFormer is a transformer-based model that creates crystal structures by leveraging space group symmetry, enhancing efficiency and reducing data requirements in crystal generation.
  • Nano Aha Moment. A single file, single GPU, from scratch full parameter tuning library that replicates DeepSeek R1-Zero style training.
  • Object Counting. A fully automated zero-shot object counting approach that uses feature maps and self-attention mechanisms, achieving state-of-the-art results on the FSC147 dataset.DeepSeek 1.58bit GGUF.The Unsloth team identified which parts of the new R1 model can be effectively quantized, noting some tokenizer quirks that complicate the process. In short, only the MoE layers are quantized to 1.58 bits, while the rest stay at 4 or 6 bits using their dynamic quantization approach.
  • Granite Speech 8B. IBM silently launched a state-of-the-art speech recognition and understanding model based on its Granite series.
  • Start building with Gemini 2.5 Pro. Google’s Gemini 2.5 Pro is now in public preview via the Gemini API on Google AI Studio, with Vertex AI availability coming soon.
  • Benchmarking Web Agent Capabilities. Online-Mind2Web is a practical evaluation benchmark for autonomous web agents, revealing that current models underperform compared to prior assumptions due to issues with earlier benchmarks.VarGPT.A unified autoregressive model that handles both understanding and synthesis tasks, enabling it to generate images as well as produce captions.
  • FlexTok: Resampling Images into 1D Token Sequences of Flexible Length. Apple’s open-source release builds on its recent paper, introducing a method to tokenize images using a variable number of tokens, allowing more complex images to be represented with more tokens.
  • ZClip: Adaptive Spike Mitigation for LLM Pre-Training. ZClip employs EMA-based gradient norm statistics to dynamically suppress outlier gradients, avoiding loss spikes and enhancing training stability without relying on fixed thresholds.
  • Goku Video Model. Goku from ByteDance is a flow based video generation model of 2B and 8B parameters with 160M image and 36M video pairs.
  • AI Index 2025: State of AI in 10 Charts. A clear, high-level, and thorough overview in 10 charts capturing the current landscape of AI, covering models, funding, and associated costs.
  • Benchmarking Open Source models for OCR. OCR involves recognizing text within images — a task that’s difficult in rare cases but highly valuable when accurate. While closed models like the Gemini series excel at it, the latest Llama 4 models significantly advance the performance of open-source alternatives.
  • DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level. Together AI has developed a coding model that rivals closed-source reasoning models. They’ve released the data, code, and training recipes, highlighting the model’s impressive long-context capabilities.
  • Hi Dream Image Generation Model. A powerful 17B parameter image generation model that leverages four distinct text encoders for generation, delivering strong overall results and released under a permissive license.
  • A Framework for Dynamic Multi-Product Pricing. This paper presents a new dynamic multi-product pricing framework based on a censored multinomial logit model, where buyers only evaluate products priced below their personal valuations.
  • MotifBench for Protein Design. MotifBench is a benchmark for computational protein design, centered on the motif-scaffolding challenge by finding protein structures that support and stabilize specific target motifs.
  • Arabic AI Benchmarks. Inception and MBZUAI have introduced a unified Arabic AI evaluation platform, featuring refreshed AraGen benchmarks and a new instruction-following leaderboard based on the Arabic IFEval benchmark.
  • 17K reasoning traces from R1. A great set of reasoning traces from R1 that can be used as training data to distill a smaller reasoner or kick start the RL process.
  • How Google Built the Pixel’s Add Me Feature. The “Add Me” feature on Pixel devices leverages advanced image segmentation and AI for personalized video experiences.
  • PaperBench: Evaluating AI’s Ability to Replicate AI Research. OpenAI introduces PaperBench, a benchmark to evaluate whether AI agents can replicate cutting-edge machine learning research papers from scratch. The challenge requires agents to understand papers, build codebases, and run experiments to match results, with each paper accompanied by a detailed rubric. Evaluation is done using an LLM-based judge that scores with high agreement to human experts. The highest score was 21.0% by Claude 3.5 Sonnet, with no model surpassing 26.0%. ML PhDs scored 41.4% on a subset in 48 hours, showing humans still outperform in long-term tasks. A simplified Code-Dev version showed better results for o1 (43.4%). Models often struggled with early failure, lack of planning, and iteration, highlighting the importance of proper prompting and scaffolding.
  • Command A: An Enterprise-Ready Large Language Model. Cohere introduces Command A, a 111B parameter open-weights LLM designed for enterprise tasks like RAG, agents, code, and multilingual applications. Command A uses a decentralized training pipeline where expert models are fine-tuned for specific domains and then merged, maintaining most expert performance with a minimal drop. Its hybrid architecture improves long-context efficiency, supporting 256k contexts with lower memory usage, and it outperforms peers in long-context benchmarks. Command A excels in agentic capabilities, surpassing GPT-4o and Claude 3.5 in multiple tests. It leads in real-world generative tasks and RAG use cases, with top scores in multilingual tasks, including dialect alignment and language consistency. The model also undergoes alignment with SRPO and RLHF, showing significant improvements in human alignment. Despite its size, Command A is efficient, running on just 2×A100s or H100s and generating 156 tokens/sec. Model weights are openly available on Hugging Face.
  • Open Deep Search: Democratizing Search with Open-source Reasoning Agents. Researchers from Sentient, UW, Princeton, and UC Berkeley introduce Open Deep Search (ODS), an open-source AI framework that competes with proprietary systems like GPT-4o Search Preview and Perplexity Sonar. ODS consists of two components: the Open Search Tool, which refines web results through query rephrasing and reranking, and the Open Reasoning Agent, which orchestrates tool usage to answer queries. ODS-v2, built on DeepSeek-R1, outperforms GPT-4o Search Preview by 9.7% on FRAMES and offers better cost-efficiency. It also surpasses Perplexity Sonar on complex reasoning tasks. The addition of CodeAct in ODS-v2 allows the system to run Python code for improved reasoning and precision, offering more flexibility than the CoT-based ReAct in ODS-v1.
  • Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models. This survey examines reasoning economy in LLMs, exploring how to balance deep reasoning performance with computational cost. It reviews inefficiencies, behavioral patterns, and potential solutions during both post-training and inference stages.
  • Omni SVG. By interpreting SVGs as a foreign language, a pretrained Qwen model can generate original SVGs from text and images, setting a new state-of-the-art. An open release is expected soon.
  • OLMoTrace. A debate in language modeling centers on how much models truly learn versus what they merely memorize. A new feature in the AI2 Playground addresses this by searching billions of input documents in real time to determine whether a model’s output is original or regurgitated, providing source references from multiple documents.
  • Efficient MoE Inference. HybriMoE is a new framework for hybrid CPU-GPU inference on Mixture of Experts models, addressing instability and overhead through improved scheduling and caching techniques.
  • Protein Backbone Generation. ReQFlow establishes a new standard in protein backbone generation, achieving state-of-the-art performance while being much faster than current models — 37 times faster than RFDiffusion and 62 times faster than Genie2 for 300-length sequences.
  • Cogito v1 Preview: Introducing IDA as a path to general superintelligence. The Cognito team has open-sourced LLMs from 3B to 70B parameters, all outperforming leading open models in their size class, with the 70B model exceeding Llama 4 109B MoE. Trained using Iterated Distillation and Amplification (IDA), these models handle both direct and reflective answering, and are available on Hugging Face, Ollama, Fireworks AI, and Together AI.
  • OmniCaptioner. OmniCaptioner is an all-in-one visual captioning framework that produces rich textual descriptions across various visual domains, such as natural images, textual visuals, and structured graphics. It boosts visual reasoning in LLMs, supports better image generation, and enables efficient supervised fine-tuning with reduced data requirements.
  • BrowseComp agent benchmark. OpenAI has introduced a new agent-based benchmark that evaluates an agent’s skill in finding difficult-to-locate information through browser interactions. Its DeepResearch model scores 51%, compared to roughly 80% for humans.
  • Neural Motion Simulator for Embodied AI. MoSim introduces a world model for motion dynamics that improves skill acquisition and enables zero-shot learning, effectively turning model-free RL into model-based.
  • VideoChat R1. An inference time compute captioning method for video question answering. It uses RL to improve the overall reasoning of the model which leads to a 30% boost in object tracking tasks.
  • Mammal Pose Estimation. KITPose is a novel keypoints-interactive model designed for general mammal pose estimation.
  • Our vision for accelerating creativity and productivity with agentic AI. Adobe is embedding agentic AI throughout its product lineup — Acrobat, Photoshop, and Premiere Pro — to boost creativity and productivity by automating routine tasks and offering intelligent suggestions. This AI-driven approach empowers users by reducing technical overhead, streamlining workflows, and enabling greater focus on creative work.
  • US engineers’ AI converts simple text into walking robots in a day. Duke University’s Text2Robot allows non-experts to design functional 3D robots using simple text prompts through a generative AI framework, making robotic design more accessible by removing the need for advanced technical expertise.
  • Sculptor: Catch and fix issues as you code. Sculptor is an early access coding agent environment that incorporates software engineering best practices by executing code in secure, sandboxed environments.
  • WordPress AI Website Builder. A new AI-powered website builder generates full WordPress sites based on user input, making it a great tool for entrepreneurs, freelancers, and bloggers.

Perspectives

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

--

--

Salvatore Raieli
Salvatore Raieli

Written by Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence

Responses (1)