Sitemap

WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 14 — 20 April

15 min readMay 11, 2025
Photo by Markus Winkler on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

75 stories

Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.

Research

  • Document Reranking. LLM4Ranking is a recently introduced modular framework that works with both open and closed LLMs for document reranking, offering evaluation tools and reproducible benchmarks on well-known datasets.
  • d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning. The d1 framework enhances masked diffusion language models through a two-stage process: supervised fine-tuning on a small dataset followed by task-specific reinforcement learning using the novel diffu-GRPO method. This approach enables efficient gradient updates via random prompt masking, achieving strong performance gains on reasoning tasks like GSM8K and MATH500, outperforming similarly sized models while benefiting from longer outputs and faster convergence.
  • Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability. Researchers show that smaller models can gain strong reasoning abilities by being fine-tuned on final answers (and optionally summarized reasoning) from large LLMs. Using a curated 1.3M-instance dataset, they test different distillation strategies, finding that training on final answers alone boosts math/coding accuracy, while combining with summarized thoughts aids alignment tasks. Results highlight trade-offs in including reasoning traces and suggest future blending techniques for improved performance.
  • Visual Reasoning with Less Data. Using MCTS to quantify sample difficulty, ThinkLite-VL improves reasoning in VLMs with just 11k training samples and no distillation
  • Reasoning Models Can Be Effective Without Thinking. The paper introduces NoThinking, a prompting method that skips explicit reasoning steps yet matches or outperforms traditional chain-of-thought approaches in tasks like math, coding, and theorem proving. By jumping directly to answers with a dummy “Thinking” block, it achieves better accuracy–latency tradeoffs, excels in low-token settings, and benefits from parallel decoding, making it both faster and more efficient across multiple benchmarks.
  • SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users. SocioVerse, developed by Fudan University and collaborators, is a large-scale social simulation platform using LLM agents aligned with real-world data across environment, user demographics, interaction scenarios, and behavior. It achieves high accuracy in modeling elections, sentiment, and economic patterns, demonstrating the value of realistic user modeling. SocioVerse offers a scalable, flexible framework for testing sociopolitical hypotheses and bridging AI with social science.
  • M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models. M1 is a Mamba-based reasoning model trained with extended test-time computation, delivering solid performance — particularly on long-context tasks and throughput — though it doesn’t quite reach state-of-the-art levels.
  • Large Reasoning Models as a Judge. JudgeLRM is a family of LLMs trained with reinforcement learning for judgment tasks. Unlike SFT, it excels in reasoning-heavy evaluations, outperforming models like GPT-4 and DeepSeek-R1.
  • Conversational AI for Cells. C2S-Scale is a new family of LLMs that interprets single-cell data and translates biological signals into natural language for applications in personalized medicine and drug discovery.
  • DocAgent: A Multi-Agent System for Automated Code Documentation Generation. Meta AI’s DocAgent is a tool-integrated framework that generates high-quality docstrings for complex codebases using a team of specialized agents and a topological traversal strategy. By parsing code dependencies and incrementally building context, it avoids token overflow and improves documentation quality. Evaluated on Python projects, DocAgent significantly outperforms baselines in completeness, helpfulness, and truthfulness, with its dependency-aware Navigator proving essential to its success.

News

Resources

  • Anthropic Education Report: How University Students Use Claude. Anthropic has released a great educational report on how different groups of university students are using AI. Most groups in STEM use it for homework help while groups in humanities use it less and generally more for ideation and brainstorming.
  • 3D Object Part Segmentation. HoloPart is a semantic 3D Object segmentation model that can identify and separate a single 3D object into meaningful sub-pieces.
  • Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models. C-Prune is a two-stage pruning method that compresses Mixture-of-Experts models by clustering similar experts and pruning redundant clusters.
  • Jax Recommendation Engine. A great recommendation engine with metrics, implementations of embedding models, and training infrastructure.
  • Reasoning VLM from Kimi. An early open model for visual question answering, this compact model excels at grounded image-based questions, image captioning, and even some image-related math.
  • Fully open fast inference models . Apriel models from ServiceNow research are designed for fast inference and showcase good performance.
  • GUI-R1. GUI-R1, developed by researchers in Singapore and China, is a reinforcement learning framework that enhances GUI agents by using a unified action space and reinforcement fine-tuning, needing only 3,000 curated examples. It achieves superior performance and generalization across platforms like Windows, Mac, Android, and Web, outperforming models trained on millions of examples while remaining efficient and adaptable with minimal data.
  • AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents. AgentA/B is an automated A/B testing system that uses LLM-based agents to simulate realistic user behavior on live websites, enabling fast, low-risk UX evaluation. With modular components and DOM parsing for structured interactions, it replicates human-like shopping patterns and supports inclusive prototyping. Tests on Amazon showed agents responded meaningfully to interface changes, suggesting strong alignment with real user behavior and value as a pre-deployment testing layer.
  • Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model. ByteDance has published a paper demonstrating how to train a competitive 7B-parameter video generation model using a relatively modest compute budget of 655,000 H100 hours, achieving strong results on several challenging temporal tasks.
  • PixelFlow: Pixel-Space Generative Models with Flow. Due to computational limits, most generative models for continuous signals work in latent space. This study presents a cascade approach that enables direct generation in pixel space, removing the requirement for a pretrained VAE.
  • InteractVLM: 3D Interaction Reasoning from 2D Foundational Models. New VLM that can reason about contacts between humans in 3D and objects. It does so by leveraging a strong base model and lifting its reasoning into 3D with clever multi-view rendering.
  • 3B parameter tokenizer. Scaling image tokenizers is difficult due to their tendency to collapse. This study presents GigaTok, a large-scale tokenizer that achieves excellent reconstruction quality, with stability and performance improved through decoder scaling and regularization.
  • Improved MoE with C3PO. C3PO proposes a test-time optimization method that boosts accuracy in Mixture-of-Experts LLMs by adjusting expert weights using similar reference examples.
  • BrowseComp Benchmark for Hard-to-Find Knowledge. OpenAI’s BrowseComp is a benchmark consisting of 1,266 tasks aimed at testing AI agents’ ability to browse the web and retrieve complex, hard-to-find information.
  • DeepSeek to Open Source its Inference Engine. DeepSeek’s inference engine is built on VLLM, although it is now heavily modified.
  • MoonDream 2.0 Release. MoonDream is a small, 2B parameter VLM that outperforms many open and closed models. It has recently gotten a strong upgrade on chartQA and a number of other useful benchmarks.
  • Data Decide. AllenAI has released a tool that can be used to help decide which data to include in pre-training. This framework is quite useful for understanding what goes into a filtering run for pre-training.
  • Conversion Rate Prediction in Ad Systems. Pinterest researchers have proposed a multitask framework that uses Deep Hierarchical Ensemble Networks to improve CVR predictions in ad systems. It shows state-of-the-art results through feature combination and auxiliary learning.
  • Open Source OpenAI Production Kernels. OpenAI has open sourced some of its fp4 and MoE kernels to the Triton language GitHub.
  • Nemotron H Models. Nvidia’s ADLR team has released the weights for its Nemotron hybrid Mamba models, which offer strong long-context handling and solid performance on general benchmarks, making them well-suited for tasks requiring extended reasoning or memory.
  • Auto Deploy. A new way to transform PyTorch and Hugging Face models into a faster, deployable, format for fast inference.
  • Latents for Generative Modeling. A top contender for blog post of the year for those into generative modeling, offering a clear breakdown of the history, core ideas, and major advancements in learned latents.
  • NVIDIA’s Temporally Consistent Video Diffusion. NVIDIA’s EquivDM framework improves video diffusion by applying consistent noise, leading to better motion tracking and more 3D-consistent results with fewer sampling steps.
  • Intellect 2 Distributed Training. Prime Intellect has developed a 32B fully distributed network trained with reinforcement learning for reasoning, and has open-sourced much of its code and valuable libraries.
  • DeepMath dataset. 103K examples of highly filtered and decontaminated math problems for reasoning model training.
  • Prima CPP. Prima CPP is an extension of llama.cpp that tries to enable mmaping of memory for large models to enable them to run on low RAM environments.
  • Tile Language. Tile Language is a compact domain-specific language aimed at simplifying the creation of high-performance GPU/CPU kernels like GEMM, Dequant GEMM, FlashAttention, and LinearAttention. It uses a Python-like syntax built on TVM’s compiler stack, enabling developer productivity while preserving low-level optimizations for top-tier performance.
  • Hugging Face Updated HELMET Benchmark. Hugging Face has expanded its HELMET benchmark to include more models and insights, helping researchers evaluate long-context LLMs like Phi-4 and Jamba 1.6.
  • Junfeng5/Liquid_V1_7B. Liquid is a multimodal LLM that integrates visual comprehension and generation by tokenizing images into discrete codes.
  • Efficient Line Art Colorization with Broader References. A new efficient long-context, fine-grained ID preservation framework for line art colorization delivers high accuracy, speed, and flexibility for comic coloring, converting black-and-white sketches into vivid illustrations by leveraging rich contextual references.
  • Scene Captioning. 3D CoCa is a unified framework that combines vision-language contrastive learning and captioning for 3D scenes.
  • DeepSpeed’s DeepCompile. The DeepSpeed team has integrated compilation into their distributed training workflow, significantly accelerating several performance bottlenecks using a modified version of torch compile.
  • Speech Instruction Fine-Tuning Dataset. SIFT-50M (Speech Instruction Fine-Tuning) is a dataset of 50 million examples created for instruction fine-tuning and pre-training speech-text LLMs. Sourced from 14,000 hours of public speech data, it uses LLMs and expert models, spans five languages, and supports both speech understanding and controllable speech generation. It enriches existing datasets with instruction-based QA pairs and includes around 5 million examples for generation tasks.
  • End-to-End Latent Diffusion Training with REPA-E. REPA-E enables stable, joint training of VAEs and latent diffusion models using a representation-alignment loss, achieving state-of-the-art results on ImageNet.
  • Meta Releases Many New Artifacts. Meta has released an image Encoder, a VLM, a 3D object localization model based on JEPA, and weights for a BLT model that operates directly on bytes without tokenization.
  • Create AI-generated soundtrack in Shorts with Dream Track. YouTube’s Dream Track is now accessible in the U.S. through YouTube Shorts and the YouTube Create app, offering AI-generated instrumental soundtracks for creators. These tracks can be globally remixed to produce unique Shorts, promoting collaboration, and are fully integrated with YouTube’s creation tools while following community guidelines.
  • SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents. SWE-PolyBench is a new benchmark for evaluating coding agents on real-world tasks in Java, JavaScript, TypeScript, and Python. It uses execution-based and syntax tree metrics, revealing that current agents struggle with complex problems and perform inconsistently across languages.
  • A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems. This survey organizes LLM reasoning methods by timing (inference-time vs. training) and architecture (standalone vs. agentic/multi-agent), spotlighting trends like learning-to-reason and agentic workflows. It reviews techniques including prompt design, output refinement, and training approaches like PPO and verifier-based learning.
  • A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science. This paper surveys spatial intelligence across fields, linking human cognition to how LLMs manage spatial memory, reasoning, and representations. It proposes a unified framework bridging AI, robotics, urban planning, and earth science, emphasizing LLMs’ growing spatial abilities and interdisciplinary relevance.

Perspectives

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

--

--

Salvatore Raieli
Salvatore Raieli

Written by Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence

No responses yet