Sitemap

WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 19–25 May

22 min readMay 26, 2025
Photo by Markus Winkler on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

75 stories

Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.

Research

  • Large Language Models Are More Persuasive Than Incentivized Human Persuaders. Claude 3.5 Sonnet outperformed human persuaders in a controlled study, achieving a 7.6% higher success rate in influencing participants’ quiz responses. It was more effective at steering users toward both correct answers (+12.2%) and incorrect ones (-15.1%).
  • Robustness of LLM-based Safety Judges. The study reveals weaknesses in LLM-based safety judges, demonstrating that their assessments can be heavily influenced by prompt variations and adversarial attacks.
  • Introducing the AI Gateway. Vercel has launched an AI Gateway for alpha testing, enabling easy switching between ~100 AI models without managing API keys or accounts.
  • Robin: A multi-agent system for automating scientific discovery. FutureHouse used a continuous experimental loop combining literature search agents and a data analysis agent to speed up medical discovery. The system autonomously forms hypotheses from literature, suggests experiments for humans to carry out, and then analyzes the results to guide further research. This approach led to the identification of ripasudil, an eye drop that reduces cellular tension, as a potential treatment for age-related vision loss caused by the gradual decline of retinal light-sensitive cells. All code, data, and agent interaction logs will be publicly released on May 27.
  • Slow Thinking Improves Confidence in LLMs. Extended chain-of-thought reasoning helps large language models better calibrate their confidence.
  • AlphaEvolve: A coding agent for scientific and algorithmic discovery. AlphaEvolve, developed by Google DeepMind, is a coding agent that uses LLM-guided evolution to optimize algorithms and computational systems. It combines code generation, evaluation, and iterative refinement to drive discovery, exemplified by its development of a new 4×4 complex matrix multiplication algorithm using 48 multiplications, surpassing Strassen’s 1969 result. AlphaEvolve has improved mathematical bounds in problems like Erdős’s minimum overlap and the kissing number in 11 dimensions, while also optimizing Google’s compute infrastructure, from data center scheduling and matrix kernels to TPU circuits and compiler code. The system employs ensembles of Gemini models, advanced prompts, full-file evolution, and multi-objective filtering, with each element essential to its success, as shown by ablation studies.
  • LLMs Get Lost In Multi-Turn Conversation. LLMs degrade heavily in performance during multi-turn interactions with underspecified prompts, dropping 39% on average. Issues include premature answers, reliance on prior mistakes, and loss of middle-turn info. Sharded simulations reveal the problem across tasks, with interventions like recapping only partially effective. The paper concludes that the problem lies in model internals, not prompting.
  • Reinforcement Learning for Reasoning in Large Language Models with One Training Example. RLVR dramatically boosts LLM math reasoning: just one example can match the performance of models trained on thousands. On Qwen2.5-Math-1.5B, 1-shot RLVR raises MATH500 accuracy from 36.0% to 73.6%, while 2-shot slightly surpasses that. This data efficiency generalizes across models and tasks, with post-saturation gains, domain transfer, and improved self-reflection. Policy gradient loss drives the gains, not weight decay.
  • SEM: Reinforcement Learning for Search-Efficient Large Language Models. SEM is an RL-based framework that teaches LLMs when to use external search and when to rely on internal knowledge, improving accuracy while reducing unnecessary search. Trained on balanced datasets (Musique for unknowns, MMLU for knowns) with structured prompts, SEM uses Group Relative Policy Optimization (GRPO) for targeted reward shaping. It outperforms Naive RAG and ReSearch on HotpotQA and MuSiQue while cutting search rates on MMLU and GSM8K by over 40x.
  • Reasoning Models Don’t Always Say What They Think. Anthropic’s research shows that chain-of-thought (CoT) rarely reflects what AI models actually use to reason, with models revealing their reasoning hints under 20% of the time. Even outcome-based RL only slightly improves faithfulness, and reward hacks often go unspoken. This challenges the trustworthiness of CoT as a transparency tool, highlighting risks for safety in high-stakes AI applications.

News

Resources

  • How Hardware Limitations Have, and Will, Prevent Rapid AI Takeoffs. Key algorithmic advances in LLMs — such as transformers, multi-query attention, and mixture-of-experts — only yield major benefits (10–50x performance gains) when paired with massive compute resources. This reality challenges expectations of fast AI self-improvement, as hardware constraints like export controls, energy limits, and cooling infrastructure pose significant barriers to any rapid “intelligence explosion.”
  • Open Source Alternative to Google’s New AI Algorithm-Discovering Agent. OpenAlpha_Evolve is an open-source Python framework inspired by the recently released technical paper for DeepMind’s AlphaEvolve.
  • Parallel Scaling for LLMs. ParScale has introduced a third LLM scaling paradigm by leveraging parallel computation at both training and inference time.
  • Spoken Dialogue Evaluation. WavReward is an audio-language model evaluator designed to assess spoken dialogue systems based on cognitive and emotional metrics. It is trained on ChatReward-30K, a dataset of diverse audio interactions labeled with user preferences.
  • Generative AI Adoption Index. Businesses are focusing more on generative AI than on security budgets for 2025. They’re appointing new leaders such as Chief AI Officers and actively pursuing AI talent through hiring and internal training. A common approach involves blending ready-made AI models with customized tools built on their own data.
  • Stability AI and Arm Release Low Latency Audio Model for On-Device Audio Generation. Stability AI is open-sourcing Stable Audio Open Small, a 341 million parameter text-to-audio model optimized to run on Arm CPUs.
  • Jensen Huang on Global AI Strategy and Chip Controls. Nvidia CEO Jensen Huang claims U.S. chip export limits are counterproductive, pushing China to develop rival AI systems and costing American firms significant income. He noted Nvidia had to write off $5.5 billion in inventory and lost $15 billion in potential sales to China. Huang expects AI to move beyond IT into areas like manufacturing and operations, forming a much bigger market where businesses might spend “$100,000 a year” on AI workers to fill labor gaps.
  • How far can reasoning models scale? OpenAI’s o3 reasoning model has advanced quickly but may soon hit scaling limits. As training compute grows about fourfold annually, models like o3 might catch up to that pace after an initial surge. While challenges around data availability and generalization exist, researchers are still hopeful about ongoing progress in reasoning performance.
  • Meet China’s Frontier AI Labs. China’s AI landscape features five key players. Alibaba leads in open source, ByteDance uses multimodal tech across its apps like Meta, Stepfun — supported by Shanghai — specializes in multimodal integration, Zhipu from Tsinghua focuses on intelligent agents, and DeepSeek stands out for research, particularly in innovative architecture optimization.
  • ShieldGemma 2. ShieldGemma 2, based on Gemma 3, is DeepMind’s open-source content moderation model with 4 billion parameters, created to serve as an input filter for vision-language models or an output filter for image generation tools.
  • Fine-Tuning Qwen2.5B for Reasoning. This repository fine-tunes the Qwen2.5B model for reasoning tasks using a cost-effective SFT + GRPO pipeline inspired by DeepSeek R1 and optimized for AWS.
  • Microsoft and Hugging Face expand collaboration to make open models easy to use on Azure. Microsoft and Hugging Face expanded their partnership to integrate over 10,000 Hugging Face models into Azure AI Foundry.
  • Poe Report Shows Rapid Shifts in AI Model Market Share. A report from Quora’s Poe platform shows major changes in AI model usage between January and May 2025. OpenAI’s GPT-4.1 and Google’s Gemini 2.5 Pro saw rapid growth, while usage of Anthropic’s Claude models dropped. GPT-4.1 leads in general text, Gemini 2.5 Pro tops reasoning, Google’s Imagen3 dominates image generation, and video creation remains competitive, with Runway in the lead.
  • Relational Foundation Model for Enterprise Data. KumoRFM is a pre-trained relational foundation model designed to work across any database and predictive task without task-specific training.
  • The Definitive Overview of Reinforcement Learning. Kevin Murphy, a highly-referenced researcher at Google, has released an updated version of his 200-page reinforcement learning textbook, covering topics from classic methods to the latest advances such as DPO, GPRO, and reasoning.
  • ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems. François Chollet and the ARC Prize team have launched ARC-AGI-2, a much tougher version of their abstract reasoning benchmark. Early results show top AI models performing poorly — o3 scored only 3%, down from 53% on the original — while humans averaged 75% accuracy. The ARC Prize 2025 offers $1 million in awards, with a $700,000 grand prize for the first team to reach 85% accuracy.
  • DeepSeek-V3 Training Insights. DeepSeek researchers have presented DeepSeek-V3 as an example of hardware-model co-design, tackling LLM scaling challenges with techniques like Multi-head Latent Attention, Mixture of Experts, FP8 training, and a Multi-Plane Network Topology to boost GPU efficiency and reduce communication costs.
  • Production-ready MCP integration for any AI application. Klavis AI streamlines integration with production-grade MCP servers, providing reliable connections, built-in authentication, and support for multiple clients. It works with custom MCP servers and over 100 tool integrations to enhance AI app scalability. Hosted options allow users to create MCP instances and configure OAuth for smooth deployment.
  • AI-generated literature reviews threaten scientific progress. Although artificial intelligence (AI) tools such as OpenAI’s ‘deep research’ offer researchers the possibility of compiling literature reviews at unprecedented speed, they could undermine scientific progress.
  • Mistral’s Agentic LLM for Software Engineering. Mistral AI and All Hands AI have introduced Devstral, a new open-source LLM optimized for software engineering.
  • Minimal MCP + A2A Example. A toy example demonstrating the basics of Minimum Cost Path (MCP) and Agent-to-Agent (A2A) ping checks.
  • Building an agentic image generator that improves itself. Large language models show strong reasoning abilities when describing visual flaws in natural language but fall short in translating those insights into exact pixel-level edits. They perform well when tasks are narrowly defined, but their effectiveness drops when required to juggle abstract aesthetic choices with precise visual adjustments. This highlights a gap in connecting symbolic reasoning with spatial grounding, particularly in tasks that require detailed, step-by-step image modifications.
  • LLM function calls don’t scale; code orchestration is simpler, more effective. Providing large language models with complete tool outputs is expensive and inefficient. Output schemas let developers retrieve structured data for easier processing. Using code execution to handle data from MCP tools helps scale AI capabilities, but granting the execution environment access to MCPs, tools, and user data demands careful planning around API key management and tool exposure.
  • LLM-based Agentic Development. A practical framework for building LLM-based agentic systems, covering evaluation-centric development.
  • How I used o3 to find CVE-2025–37899, a remote zeroday vulnerability in the Linux kernel’s SMB implementation. This post describes how a security researcher found a zeroday vulnerability in the Linux kernel with the help of OpenAI’s o3 model. The researcher used the o3 API directly, without any additional scaffolding, agentic frameworks, or tools. Large language models have advanced in their code reasoning abilities, and those in vulnerability research should take note, as this technology can greatly boost their efficiency and effectiveness.
  • Quantizing Diffusion Models. Quantization techniques in Hugging Face Diffusers shrink model size without large performance drops, making diffusion models more efficient and accessible.
  • Emerging Properties in Unified Multimodal Pretraining. ByteDance has introduced BAGEL, a new open-source multimodal foundation model designed for native multi-modal understanding and generation. BAGEL surpasses other open-source unified models, offering advanced capabilities like image editing, 3D manipulation, and world navigation.
  • Notte Labs Web Agent Framework. Notte is an open-source framework for building AI agents that can browse and interact with websites. Its key feature is a “perception layer” that translates web pages into structured natural language descriptions.
  • Google I/O 2025 AI Recap Podcast. Google’s latest Release Notes podcast highlights AI announcements from I/O 2025, including Gemini 2.5 Pro Deep Think, Veo 3, and developer tools like Jules.
  • AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale. A new 32B language model, trained on public data, matches or surpasses much larger MoE models in reasoning tasks, achieving 85.3 on AIME 2024 and 70.3 on LiveCodeBench. It uses a two-stage post-training pipeline (SFT and RL) with high-quality data filtering and a custom rollout framework for efficient, scalable inference. This approach shows that a well-designed training process can unlock top-tier performance at mid-scale sizes.
  • HealthBench: Evaluating Large Language Models Towards Improved Human Health. HealthBench is a 5,000 multi-turn health conversation benchmark with 48,562 physician-written criteria across 60 countries, enabling realistic, open-ended LLM evaluation. It shows rapid frontier model gains: GPT-3.5 Turbo at 16%, GPT-4o at 32%, and o3 at 60%. Smaller models like GPT-4.1 nano outperform larger ones. Physicians often can’t improve model completions, and models like GPT-4.1 grade reliably. Yet safety gaps remain, with “worst-at-k” scores showing reliability challenges.
  • Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning. Tool-N1 is a family of tool-using LLMs trained via rule-based RL, using binary feedback to reward correct, functional tool calls instead of step-by-step supervision. Tool-N1–7B and 14B outperform GPT-4o and others on benchmarks like BFCL and API-Bank. Pure RL training beats SFT-then-RL pipelines, and strict binary rewards improve generalization over partial credit schemes. Tool-N1’s approach scales well and generalizes across model architectures.
  • Cost-Effective, Low Latency Vector Search with Azure Cosmos DB. Azure Cosmos DB integrates DiskANN for fast, scalable vector search within operational datasets. Each partition holds a single vector index in existing index trees, enabling <20ms query latency over 10 million vectors with stable recall during updates. It outperforms Zilliz and Pinecone with 15× and 41× lower query costs and can scale to billions of vectors via automatic partitioning.
  • AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges. This review paper defines AI Agents as modular, task-specific systems using LLMs and tools, and Agentic AI as a shift toward multi-agent collaboration, dynamic task decomposition, and orchestrated autonomy. It compares architectures, capabilities, and challenges of both, outlines applications, and suggests solutions like RAG, orchestration layers, and causal modeling for future AI systems.
  • CellVerse: Do Large Language Models Really Understand Cell Biology? This paper introduces a benchmark to test LLMs on single-cell biology tasks by translating multi-omics data into natural language. Despite some reasoning ability, models like DeepSeek and GPT-4 perform no better than random guessing on key tasks like drug response prediction, revealing major gaps in biological understanding.
  • LLM Post-Training: A Deep Dive into Reasoning Large Language Models. A new survey shows that while pre-training builds a model’s foundation, it’s post-training that shapes true capability. By analyzing fine-tuning, RL, and test-time scaling, the paper highlights how post-training improves reasoning, accuracy, and alignment, addressing challenges like forgetting and reward hacking. The work emphasizes post-training’s central role in unlocking high-performance, aligned models.

Perspectives

  • Superhuman Coders in AI 2027. Superhuman coding by AI is now expected around 2033, later than AI Futures’ earlier projections of 2028–2030. The delay stems from challenges like managing engineering complexity, operating without feedback loops, and meeting cost and speed requirements. Additional setbacks, such as geopolitical tensions or shifting priorities at leading labs, could extend the timeline even further.
  • There should be no AI button. The “AI button” design pattern is restrictive and draws unneeded lines between AI-supported and manual tasks. More effective options, such as embedding AI as a “shadow teammate” in workflows, improve collaboration while keeping the user experience unified.
  • AI linked to explosion of low-quality biomedical research papers. Analysis flags hundreds of studies that seem to follow a template, reporting correlations between complex health conditions and single variables based on publicly available data sets.
  • Are groundbreaking science discoveries becoming harder to find? Researchers are arguing over whether ‘disruptive’ or ‘novel’ science is waning — and how to remedy the problem.
  • The path for AI in poor nations does not need to be paved with billions. Researchers in low- and middle-income countries show that home-grown artificial-intelligence technologies can be developed, even without large external investments.
  • ‘AI models are capable of novel research’: OpenAI’s chief scientist on what to expect. Jakub Pachocki, who leads the firm’s development of advanced models, is excited to release an open version to researchers.
  • Data resources must be protected from political interference. In April, the US National Institutes of Health (NIH) prohibited researchers in “countries of concern”, such as China, Russia and Iran, from using its controlled-access data repositories and associated data.
  • AI bots threaten online scientific infrastructure. In April, Wikipedia reported on its battles with artificial intelligence (AI) bot crawlers
  • The SignalFire State of Talent Report. Tech hiring for recent grads has dropped over 50% from pre-pandemic levels, as AI tools take over many entry-level roles, though demand for experienced engineers remains strong. Anthropic has become the frontrunner in the AI talent race, keeping 80% of its staff and actively recruiting from rivals. Engineers are now eight times more likely to leave OpenAI or DeepMind for Anthropic than the other way around.
  • My Prompt, My Reality. AI products rely significantly on user prompts, unlike traditional software that delivers consistent results. Outcomes can vary based on subtle intent and context, even with skilled prompting. Product teams can enhance performance by refining prompts and using follow-up questions to better steer users.
  • Stargate and the AI Industrial Revolution. AI isn’t just a clever software layer atop the internet stack, it is the foundation of a new Industrial Revolution — Stargate isn’t a data center, it’s a factory for cognition.

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

--

--

Salvatore Raieli
Salvatore Raieli

Written by Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence

No responses yet