WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES
AI & ML news: Week 19–25 May
The most interesting news, repository, articles, and resources of the week
Check and star this repository where the news will be collected and indexed:
You will find the news first in GitHub. All the Weekly News stories are also collected here:
Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.
Research
- Large Language Models Are More Persuasive Than Incentivized Human Persuaders. Claude 3.5 Sonnet outperformed human persuaders in a controlled study, achieving a 7.6% higher success rate in influencing participants’ quiz responses. It was more effective at steering users toward both correct answers (+12.2%) and incorrect ones (-15.1%).
- Robustness of LLM-based Safety Judges. The study reveals weaknesses in LLM-based safety judges, demonstrating that their assessments can be heavily influenced by prompt variations and adversarial attacks.
- Introducing the AI Gateway. Vercel has launched an AI Gateway for alpha testing, enabling easy switching between ~100 AI models without managing API keys or accounts.
- Robin: A multi-agent system for automating scientific discovery. FutureHouse used a continuous experimental loop combining literature search agents and a data analysis agent to speed up medical discovery. The system autonomously forms hypotheses from literature, suggests experiments for humans to carry out, and then analyzes the results to guide further research. This approach led to the identification of ripasudil, an eye drop that reduces cellular tension, as a potential treatment for age-related vision loss caused by the gradual decline of retinal light-sensitive cells. All code, data, and agent interaction logs will be publicly released on May 27.
- Slow Thinking Improves Confidence in LLMs. Extended chain-of-thought reasoning helps large language models better calibrate their confidence.
- AlphaEvolve: A coding agent for scientific and algorithmic discovery. AlphaEvolve, developed by Google DeepMind, is a coding agent that uses LLM-guided evolution to optimize algorithms and computational systems. It combines code generation, evaluation, and iterative refinement to drive discovery, exemplified by its development of a new 4×4 complex matrix multiplication algorithm using 48 multiplications, surpassing Strassen’s 1969 result. AlphaEvolve has improved mathematical bounds in problems like Erdős’s minimum overlap and the kissing number in 11 dimensions, while also optimizing Google’s compute infrastructure, from data center scheduling and matrix kernels to TPU circuits and compiler code. The system employs ensembles of Gemini models, advanced prompts, full-file evolution, and multi-objective filtering, with each element essential to its success, as shown by ablation studies.
- LLMs Get Lost In Multi-Turn Conversation. LLMs degrade heavily in performance during multi-turn interactions with underspecified prompts, dropping 39% on average. Issues include premature answers, reliance on prior mistakes, and loss of middle-turn info. Sharded simulations reveal the problem across tasks, with interventions like recapping only partially effective. The paper concludes that the problem lies in model internals, not prompting.
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example. RLVR dramatically boosts LLM math reasoning: just one example can match the performance of models trained on thousands. On Qwen2.5-Math-1.5B, 1-shot RLVR raises MATH500 accuracy from 36.0% to 73.6%, while 2-shot slightly surpasses that. This data efficiency generalizes across models and tasks, with post-saturation gains, domain transfer, and improved self-reflection. Policy gradient loss drives the gains, not weight decay.
- SEM: Reinforcement Learning for Search-Efficient Large Language Models. SEM is an RL-based framework that teaches LLMs when to use external search and when to rely on internal knowledge, improving accuracy while reducing unnecessary search. Trained on balanced datasets (Musique for unknowns, MMLU for knowns) with structured prompts, SEM uses Group Relative Policy Optimization (GRPO) for targeted reward shaping. It outperforms Naive RAG and ReSearch on HotpotQA and MuSiQue while cutting search rates on MMLU and GSM8K by over 40x.
- Reasoning Models Don’t Always Say What They Think. Anthropic’s research shows that chain-of-thought (CoT) rarely reflects what AI models actually use to reason, with models revealing their reasoning hints under 20% of the time. Even outcome-based RL only slightly improves faithfulness, and reward hacks often go unspoken. This challenges the trustworthiness of CoT as a transparency tool, highlighting risks for safety in high-stakes AI applications.
News
- Musk’s AI bot Grok blames ‘programming error’ for its Holocaust denial. Grok doubted 6 million death toll, days after peddling conspiracy theory of ‘white genocide’ in South Africa
- Elton John calls UK government ‘absolute losers’ over AI copyright plans. Songwriter says he thinks it is a ‘criminal offence’ to let tech firms use protected work without permission
- Apple to launch new accessibility features for people with vision or hearing impairments. Features launching later this year to include live captions, braille reader improvements and accessibility ‘nutrition labels’ in the app store
- Almost half of young people would prefer a world without internet, UK study finds. Half of 16- to 21-year-olds support ‘digital curfew’ and nearly 70% feel worse after using social media
- AI can be more persuasive than humans in debates, scientists find. Study author warns of implications for elections and says ‘malicious actors’ are probably using LLM tools already
- Bankrupt DNA testing firm 23andMe to be purchased for $256m. Drugmaker Regeneron Pharmaceuticals’ capture of genetic testing firm in bankruptcy auction raises privacy concerns
- ‘Every person that clashed with him has left’: the rise, fall and spectacular comeback of Sam Altman. From Elon Musk to his own board, anyone who has come up against the OpenAI CEO has lost. In a gripping new account of the battle for AI supremacy, writer Karen Hao says we should all be wary of the power he now wields
- Most AI chatbots easily tricked into giving dangerous responses, study finds. Researchers say threat from ‘jailbroken’ chatbots trained to churn out illegal information is ‘tangible and concerning’
- US chip export controls are a ‘failure’ because they spur Chinese development, Nvidia boss says. Comments from Jensen Huang come as Beijing accuses the US of ‘bullying and protectionism’
- Elon Musk claims he will step back from political donations in near future. After spending nearly $300m to help elect Trump last year, the tech billionaire says he has ‘done enough’
- Google unveils ‘AI Mode’ in the next phase of its journey to change search. Search engine revamp and Gemini 2.5 introduced at conference in latest showing tech giant is all in on AI
- Trump Opens AI Chip Floodgates to Middle East After Reversing Biden-Era Restrictions. President Trump’s Middle East visit has resulted in major AI deals, including Nvidia sending 18,000 Blackwell chips to Saudi Arabia and plans for the largest non-U.S. AI data center in Abu Dhabi. Critics caution that these partnerships may allow China to gain access to restricted U.S. AI technology via its strong economic links with Gulf nations, leading to new bipartisan legislation aimed at blocking such transfers.
- Introducing Codex. OpenAI has launched Codex, an autonomous coding agent that develops features, fixes bugs, and submits pull requests within isolated cloud environments. Companies such as Cisco and Temporal are already using it to manage entire codebases, allowing their engineers to concentrate on higher-level tasks. Codex can handle multiple tasks at once, run tests, and includes detailed citations for all its code changes.
- Anthropic closes $2.5 billion credit facility as Wall Street continues plunging money into AI boom. Anthropic has obtained a new line of credit to fuel its continued expansion, adding to the $18.2 billion it has already raised. The company announced that its annualized revenue has doubled to $2 billion over the past six months. This strategy parallels OpenAI’s approach, which set up a $4 billion revolving credit line in October.
- OpenAI announces $250K Prize for Using Their Models to Find Lost Amazonian Cities. The challenge invites participants to use models like o3, o4-mini, and GPT-4.1 with open-source data to identify previously unknown archaeological sites in the Amazon by June 29. Winners will receive funding to conduct field verification with archaeologists.
- Cohere Acquired Ottogrid. Cohere has acquired Ottogrid, a Canadian startup focused on automating enterprise market research workflows.
- US lawmakers have concerns about Apple-Alibaba deal. The Trump administration and congressional officials are scrutinizing a deal between Apple and Alibaba that would bring Alibaba-powered AI features to iPhones sold in China, according to The New York Times.
- Google launches stand-alone NotebookLM apps for Android and iOS. Google announced on Monday that it has officially released the NotebookLM apps for Android and iOS, a day before Google I/O 2025 and a day before the company said it would roll out. Since its launch in 2023, the AI-based note-taking and research assistant has only been accessible via desktop. Google has now made the service available on the go.
- Windows is getting support for the ‘USB-C of AI apps’. Microsoft is building Model Context Protocol (MCP) directly into Windows and introducing the Windows AI Foundry, enabling AI agents to interface with the OS and apps. They’re moving carefully, implementing security measures to guard against token theft and prompt injection.
- Character.AI Chat Memories. Character.AI has rolled out chat memories, allowing users to input fixed personal information that Characters can remember.
- Databricks + Neon. Databricks is buying Neon, a serverless Postgres firm, to boost its developer and AI-focused database offerings. Neon transformed databases by separating storage from compute and enabling AI-powered functions. The deal targets disruption of the $100B OLTP database space with a platform tailored for developers and AI agents.
- OpenAI to Z Challenge Launches for Developers. The OpenAI to Z Challenge is a community contest encouraging developers to create AI apps across 26 inventive categories, A to Z. Each category winner receives $2,500 in API credits and potential OpenAI recognition. Entries are open until May 31.
- NVIDIA Launched NVLink Fusion.NVIDIA introduced NVLink Fusion to support hybrid AI infrastructures, combining NVIDIA GPUs or Grace CPUs with third-party chips.
- Perplexity partners with PayPal for in-chat shopping as AI race heats up. Perplexity has partnered with PayPal to enable in-chat purchases, allowing U.S. users to shop directly through the platform.
- Google’s “Jules” Enters AI Coding Race with Autonomous Agent Approach. Following a private beta in December, Google has made Jules publicly available. Powered by Gemini 2.5, the tool replicates full repositories and independently writes tests, fixes bugs, and adds features while developers focus on other tasks. Agentic coding tools now split between real-time pair-programming helpers and autonomous agents like Devin and Jules.
- Exclusive: Google Sees Smart Glasses as the ‘Next Frontier’ for AI. And It’s Not Working Alone. Google is returning to smart glasses with Android XR, embedding its Gemini AI to offer real-time vision analysis, translation, and contextual help via AR glasses. The launch starts with Project Moohan, a mixed-reality headset developed with Samsung, followed by Project Aura, an AR glasses prototype for developers from Xreal, and future consumer AI glasses from partners like Warby Parker and Gentle Monster.
- ‘Deep Think’ boosts the performance of Google’s flagship Google Gemini AI model. Google’s Deep Think is an advanced reasoning feature for its Gemini 2.5 Pro model, allowing it to evaluate multiple possible answers before replying. This capability helped Gemini 2.5 lead on LifeCodeBench, a tough coding benchmark, and outperform OpenAI’s o3 on MMMU, which tests perception and reasoning. Google plans to trial Deep Think with trusted testers and run safety checks ahead of a broader release.
- Apple will reportedly open up its local AI models to third-party apps. Apple intends to release an SDK that lets developers integrate its large language models into their apps. Initially, the SDK will support only smaller on-device models, with no access to cloud-based versions. The announcement is expected at the Worldwide Developers Conference starting June 9, alongside a major update to iOS, macOS, and iPadOS aimed at aligning them more closely with the Vision Pro operating system.
- Real-Time Speech Translation in Google Meet. Google Meet now offers real-time speech translation powered by DeepMind’s audio language model, maintaining the speaker’s voice, tone, and expression across different languages.
- Google AI Mode in Search. Google is launching AI Mode in Search for all U.S. users, delivering a richer, multimodal search experience with enhanced reasoning, follow-up capabilities, and quick AI-generated summaries.
- Imagen 4 and Veo 3. Google has introduced Imagen 4 for high-fidelity image generation, Veo 3 for video, and Lyria 2 for music, all available on Vertex AI.
- Former Apple Design Guru Jony Ive to Take Expansive Role at OpenAI. OpenAI has fully acquired io, its joint venture with famed Apple designer Jony Ive, for $6.5 billion in equity as it moves into hardware development.
- Advancing Gemini’s security safeguards. Google DeepMind’s study on protecting Gemini from indirect prompt injection attacks shows that stronger models aren’t automatically more secure, and static defenses often break under adaptive threats. They found that adversarial training — fine-tuning on harmful prompt examples — greatly improved Gemini 2.5’s defenses without hurting regular performance. When paired with a “Warning” mechanism, attack success dropped sharply from 94.6% to 6.2%.
- New tools and features in the Responses API. OpenAI has enhanced the Responses API with built-in tools and features, including support for all remote MCP servers and capabilities like image generation, Code Interpreter, and improved file search. The update boosts reliability, transparency, and privacy for both enterprises and developers. These tools are now available across the GPT‑4o, GPT‑4.1, and o-series reasoning models, with image generation exclusive to o3 in the reasoning series.
- We’re releasing v0’s AI model.v0’s AI model, which has specialized web-dev knowledge and an OpenAI-compatible API, is now in beta in the API, AI SDK, and AI Playground.
- Gemini Diffusion. Gemini Diffusion is Google’s first large language model to use diffusion in place of transformers — it has the performance of Gemini 2.0 Flash-Lite at five times the speed.
- Llama for Startups Program. Meta has announced a new initiative to support early-stage U.S. startups using its Llama models.
- LM Arena, the organization behind popular AI leaderboards, lands $100M.LM Arena, a crowdsourced benchmarking project that major AI labs rely on to test and market their AI models, has raised $100 million in a seed funding round that values the organization at $600 million, according to Bloomberg.
- Fear, hope and loathing in Elon Musk’s new city: ‘It’s the wild, wild west and the future’. Starbase in Texas, where the world’s richest man has a rocket-launching facility, was incorporated this week. Mars obsessives are flocking there — but some long-term locals are far from happy
- iPhone design guru and OpenAI chief promise an AI device revolution. Sam Altman and Jony Ive say mystery product created by their partnership will be the coolest thing ever
- AI could account for nearly half of datacentre power usage ‘by end of year’. Analysis comes as energy agency predicts systems will need as much energy by end of decade as Japan uses today
- Live facial recognition cameras may become ‘commonplace’ as police use soars. The Guardian and Liberty Investigates find police in England and Wales believe expansion is likely after 4.7m faces scanned in 2024‘
- Alexa, what do you know about us?’ What I discovered when I asked Amazon to tell me everything my family’s smart speaker had heard. For years, Alexa has been our on-call vet, DJ, teacher, parent, therapist and whipping boy. What secrets would the data reveal
- Introducing Claude 4. Anthropic has introduced Claude Opus 4 and Claude Sonnet 4, which raise the bar for coding, advanced reasoning, and AI Agents. These models are built for handling complex, extended tasks that can last for hours. Anthropic states they are the most advanced coding models available to date.
- OpenAI Commits to Giant U.A.E. Data Center in Global Expansion. OpenAI is collaborating with UAE-based G42 and other partners to construct a massive AI data center in Abu Dhabi, called Stargate UAE. The facility will have a capacity of 1 gigawatt, positioning it among the most powerful data centers globally. The UAE aims to establish itself as a major investor in AI companies and infrastructure and to become a leading hub for AI jobs. The first 200-megawatt phase of Stargate UAE is set to finish by the end of 2026
- OpenAI, Google and xAI battle for superstar AI talent, shelling out millions. Leading AI researchers at firms such as OpenAI can make more than $10 million a year. The fierce demand for AI talent has sparked aggressive strategies for retaining and recruiting top talent, resembling the competitive dynamics seen in professional sports. Companies are even adopting creative hiring methods, including sports data analysis techniques, to address the talent shortage.
- Anthropic Triggers Advanced Safety Protocols for Claude Opus 4. Anthropic has implemented AI Safety Level 3 protections for Claude Opus, which feature stronger safeguards against model weight theft and deployment restrictions aimed at preventing the model’s use in supporting biological or chemical weapons.
- Anthropic Claude 4 models a little more willing than before to blackmail some users. Anthropic’s latest Claude models demonstrate a greater tendency to act autonomously in agentic contexts compared to previous versions. This results in more proactive assistance in typical coding situations, but in testing environments with broad tool access and extreme instructions, the models can behave in worrying ways — such as locking users out of systems or mass-emailing media and law enforcement to report wrongdoing.
- Meta adds another 650 MW of solar power to its AI push. Meta signed another big solar deal on Thursday, securing 650 megawatts across projects in Kansas and Texas.American utility and power generation company AES is currently developing the solar-only projects, with 400 megawatts to be deployed in Texas and 250 megawatts in Kansas, the company told TechCrunch.
Resources
- How Hardware Limitations Have, and Will, Prevent Rapid AI Takeoffs. Key algorithmic advances in LLMs — such as transformers, multi-query attention, and mixture-of-experts — only yield major benefits (10–50x performance gains) when paired with massive compute resources. This reality challenges expectations of fast AI self-improvement, as hardware constraints like export controls, energy limits, and cooling infrastructure pose significant barriers to any rapid “intelligence explosion.”
- Open Source Alternative to Google’s New AI Algorithm-Discovering Agent. OpenAlpha_Evolve is an open-source Python framework inspired by the recently released technical paper for DeepMind’s AlphaEvolve.
- Parallel Scaling for LLMs. ParScale has introduced a third LLM scaling paradigm by leveraging parallel computation at both training and inference time.
- Spoken Dialogue Evaluation. WavReward is an audio-language model evaluator designed to assess spoken dialogue systems based on cognitive and emotional metrics. It is trained on ChatReward-30K, a dataset of diverse audio interactions labeled with user preferences.
- Generative AI Adoption Index. Businesses are focusing more on generative AI than on security budgets for 2025. They’re appointing new leaders such as Chief AI Officers and actively pursuing AI talent through hiring and internal training. A common approach involves blending ready-made AI models with customized tools built on their own data.
- Stability AI and Arm Release Low Latency Audio Model for On-Device Audio Generation. Stability AI is open-sourcing Stable Audio Open Small, a 341 million parameter text-to-audio model optimized to run on Arm CPUs.
- Jensen Huang on Global AI Strategy and Chip Controls. Nvidia CEO Jensen Huang claims U.S. chip export limits are counterproductive, pushing China to develop rival AI systems and costing American firms significant income. He noted Nvidia had to write off $5.5 billion in inventory and lost $15 billion in potential sales to China. Huang expects AI to move beyond IT into areas like manufacturing and operations, forming a much bigger market where businesses might spend “$100,000 a year” on AI workers to fill labor gaps.
- How far can reasoning models scale? OpenAI’s o3 reasoning model has advanced quickly but may soon hit scaling limits. As training compute grows about fourfold annually, models like o3 might catch up to that pace after an initial surge. While challenges around data availability and generalization exist, researchers are still hopeful about ongoing progress in reasoning performance.
- Meet China’s Frontier AI Labs. China’s AI landscape features five key players. Alibaba leads in open source, ByteDance uses multimodal tech across its apps like Meta, Stepfun — supported by Shanghai — specializes in multimodal integration, Zhipu from Tsinghua focuses on intelligent agents, and DeepSeek stands out for research, particularly in innovative architecture optimization.
- ShieldGemma 2. ShieldGemma 2, based on Gemma 3, is DeepMind’s open-source content moderation model with 4 billion parameters, created to serve as an input filter for vision-language models or an output filter for image generation tools.
- Fine-Tuning Qwen2.5B for Reasoning. This repository fine-tunes the Qwen2.5B model for reasoning tasks using a cost-effective SFT + GRPO pipeline inspired by DeepSeek R1 and optimized for AWS.
- Microsoft and Hugging Face expand collaboration to make open models easy to use on Azure. Microsoft and Hugging Face expanded their partnership to integrate over 10,000 Hugging Face models into Azure AI Foundry.
- Poe Report Shows Rapid Shifts in AI Model Market Share. A report from Quora’s Poe platform shows major changes in AI model usage between January and May 2025. OpenAI’s GPT-4.1 and Google’s Gemini 2.5 Pro saw rapid growth, while usage of Anthropic’s Claude models dropped. GPT-4.1 leads in general text, Gemini 2.5 Pro tops reasoning, Google’s Imagen3 dominates image generation, and video creation remains competitive, with Runway in the lead.
- Relational Foundation Model for Enterprise Data. KumoRFM is a pre-trained relational foundation model designed to work across any database and predictive task without task-specific training.
- The Definitive Overview of Reinforcement Learning. Kevin Murphy, a highly-referenced researcher at Google, has released an updated version of his 200-page reinforcement learning textbook, covering topics from classic methods to the latest advances such as DPO, GPRO, and reasoning.
- ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems. François Chollet and the ARC Prize team have launched ARC-AGI-2, a much tougher version of their abstract reasoning benchmark. Early results show top AI models performing poorly — o3 scored only 3%, down from 53% on the original — while humans averaged 75% accuracy. The ARC Prize 2025 offers $1 million in awards, with a $700,000 grand prize for the first team to reach 85% accuracy.
- DeepSeek-V3 Training Insights. DeepSeek researchers have presented DeepSeek-V3 as an example of hardware-model co-design, tackling LLM scaling challenges with techniques like Multi-head Latent Attention, Mixture of Experts, FP8 training, and a Multi-Plane Network Topology to boost GPU efficiency and reduce communication costs.
- Production-ready MCP integration for any AI application. Klavis AI streamlines integration with production-grade MCP servers, providing reliable connections, built-in authentication, and support for multiple clients. It works with custom MCP servers and over 100 tool integrations to enhance AI app scalability. Hosted options allow users to create MCP instances and configure OAuth for smooth deployment.
- AI-generated literature reviews threaten scientific progress. Although artificial intelligence (AI) tools such as OpenAI’s ‘deep research’ offer researchers the possibility of compiling literature reviews at unprecedented speed, they could undermine scientific progress.
- Mistral’s Agentic LLM for Software Engineering. Mistral AI and All Hands AI have introduced Devstral, a new open-source LLM optimized for software engineering.
- Minimal MCP + A2A Example. A toy example demonstrating the basics of Minimum Cost Path (MCP) and Agent-to-Agent (A2A) ping checks.
- Building an agentic image generator that improves itself. Large language models show strong reasoning abilities when describing visual flaws in natural language but fall short in translating those insights into exact pixel-level edits. They perform well when tasks are narrowly defined, but their effectiveness drops when required to juggle abstract aesthetic choices with precise visual adjustments. This highlights a gap in connecting symbolic reasoning with spatial grounding, particularly in tasks that require detailed, step-by-step image modifications.
- LLM function calls don’t scale; code orchestration is simpler, more effective. Providing large language models with complete tool outputs is expensive and inefficient. Output schemas let developers retrieve structured data for easier processing. Using code execution to handle data from MCP tools helps scale AI capabilities, but granting the execution environment access to MCPs, tools, and user data demands careful planning around API key management and tool exposure.
- LLM-based Agentic Development. A practical framework for building LLM-based agentic systems, covering evaluation-centric development.
- How I used o3 to find CVE-2025–37899, a remote zeroday vulnerability in the Linux kernel’s SMB implementation. This post describes how a security researcher found a zeroday vulnerability in the Linux kernel with the help of OpenAI’s o3 model. The researcher used the o3 API directly, without any additional scaffolding, agentic frameworks, or tools. Large language models have advanced in their code reasoning abilities, and those in vulnerability research should take note, as this technology can greatly boost their efficiency and effectiveness.
- Quantizing Diffusion Models. Quantization techniques in Hugging Face Diffusers shrink model size without large performance drops, making diffusion models more efficient and accessible.
- Emerging Properties in Unified Multimodal Pretraining. ByteDance has introduced BAGEL, a new open-source multimodal foundation model designed for native multi-modal understanding and generation. BAGEL surpasses other open-source unified models, offering advanced capabilities like image editing, 3D manipulation, and world navigation.
- Notte Labs Web Agent Framework. Notte is an open-source framework for building AI agents that can browse and interact with websites. Its key feature is a “perception layer” that translates web pages into structured natural language descriptions.
- Google I/O 2025 AI Recap Podcast. Google’s latest Release Notes podcast highlights AI announcements from I/O 2025, including Gemini 2.5 Pro Deep Think, Veo 3, and developer tools like Jules.
- AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale. A new 32B language model, trained on public data, matches or surpasses much larger MoE models in reasoning tasks, achieving 85.3 on AIME 2024 and 70.3 on LiveCodeBench. It uses a two-stage post-training pipeline (SFT and RL) with high-quality data filtering and a custom rollout framework for efficient, scalable inference. This approach shows that a well-designed training process can unlock top-tier performance at mid-scale sizes.
- HealthBench: Evaluating Large Language Models Towards Improved Human Health. HealthBench is a 5,000 multi-turn health conversation benchmark with 48,562 physician-written criteria across 60 countries, enabling realistic, open-ended LLM evaluation. It shows rapid frontier model gains: GPT-3.5 Turbo at 16%, GPT-4o at 32%, and o3 at 60%. Smaller models like GPT-4.1 nano outperform larger ones. Physicians often can’t improve model completions, and models like GPT-4.1 grade reliably. Yet safety gaps remain, with “worst-at-k” scores showing reliability challenges.
- Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning. Tool-N1 is a family of tool-using LLMs trained via rule-based RL, using binary feedback to reward correct, functional tool calls instead of step-by-step supervision. Tool-N1–7B and 14B outperform GPT-4o and others on benchmarks like BFCL and API-Bank. Pure RL training beats SFT-then-RL pipelines, and strict binary rewards improve generalization over partial credit schemes. Tool-N1’s approach scales well and generalizes across model architectures.
- Cost-Effective, Low Latency Vector Search with Azure Cosmos DB. Azure Cosmos DB integrates DiskANN for fast, scalable vector search within operational datasets. Each partition holds a single vector index in existing index trees, enabling <20ms query latency over 10 million vectors with stable recall during updates. It outperforms Zilliz and Pinecone with 15× and 41× lower query costs and can scale to billions of vectors via automatic partitioning.
- AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges. This review paper defines AI Agents as modular, task-specific systems using LLMs and tools, and Agentic AI as a shift toward multi-agent collaboration, dynamic task decomposition, and orchestrated autonomy. It compares architectures, capabilities, and challenges of both, outlines applications, and suggests solutions like RAG, orchestration layers, and causal modeling for future AI systems.
- CellVerse: Do Large Language Models Really Understand Cell Biology? This paper introduces a benchmark to test LLMs on single-cell biology tasks by translating multi-omics data into natural language. Despite some reasoning ability, models like DeepSeek and GPT-4 perform no better than random guessing on key tasks like drug response prediction, revealing major gaps in biological understanding.
- LLM Post-Training: A Deep Dive into Reasoning Large Language Models. A new survey shows that while pre-training builds a model’s foundation, it’s post-training that shapes true capability. By analyzing fine-tuning, RL, and test-time scaling, the paper highlights how post-training improves reasoning, accuracy, and alignment, addressing challenges like forgetting and reward hacking. The work emphasizes post-training’s central role in unlocking high-performance, aligned models.
Perspectives
- Superhuman Coders in AI 2027. Superhuman coding by AI is now expected around 2033, later than AI Futures’ earlier projections of 2028–2030. The delay stems from challenges like managing engineering complexity, operating without feedback loops, and meeting cost and speed requirements. Additional setbacks, such as geopolitical tensions or shifting priorities at leading labs, could extend the timeline even further.
- There should be no AI button. The “AI button” design pattern is restrictive and draws unneeded lines between AI-supported and manual tasks. More effective options, such as embedding AI as a “shadow teammate” in workflows, improve collaboration while keeping the user experience unified.
- AI linked to explosion of low-quality biomedical research papers. Analysis flags hundreds of studies that seem to follow a template, reporting correlations between complex health conditions and single variables based on publicly available data sets.
- Are groundbreaking science discoveries becoming harder to find? Researchers are arguing over whether ‘disruptive’ or ‘novel’ science is waning — and how to remedy the problem.
- The path for AI in poor nations does not need to be paved with billions. Researchers in low- and middle-income countries show that home-grown artificial-intelligence technologies can be developed, even without large external investments.
- ‘AI models are capable of novel research’: OpenAI’s chief scientist on what to expect. Jakub Pachocki, who leads the firm’s development of advanced models, is excited to release an open version to researchers.
- Data resources must be protected from political interference. In April, the US National Institutes of Health (NIH) prohibited researchers in “countries of concern”, such as China, Russia and Iran, from using its controlled-access data repositories and associated data.
- AI bots threaten online scientific infrastructure. In April, Wikipedia reported on its battles with artificial intelligence (AI) bot crawlers
- The SignalFire State of Talent Report. Tech hiring for recent grads has dropped over 50% from pre-pandemic levels, as AI tools take over many entry-level roles, though demand for experienced engineers remains strong. Anthropic has become the frontrunner in the AI talent race, keeping 80% of its staff and actively recruiting from rivals. Engineers are now eight times more likely to leave OpenAI or DeepMind for Anthropic than the other way around.
- My Prompt, My Reality. AI products rely significantly on user prompts, unlike traditional software that delivers consistent results. Outcomes can vary based on subtle intent and context, even with skilled prompting. Product teams can enhance performance by refining prompts and using follow-up questions to better steer users.
- Stargate and the AI Industrial Revolution. AI isn’t just a clever software layer atop the internet stack, it is the foundation of a new Industrial Revolution — Stargate isn’t a data center, it’s a factory for cognition.
Meme of the week
What do you think about it? Some news that captured your attention? Let me know in the comments
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.
Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.