WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES
AI & ML news: Week 26 May — 2 June
The most interesting news, repository, articles, and resources of the week
Check and star this repository where the news will be collected and indexed:
You will find the news first in GitHub. All the Weekly News stories are also collected here:
Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.
Research
- Learning to Reason without External Rewards. INTUITOR is a reinforcement learning approach that uses a language model’s internal confidence as a reward signal, avoiding expensive domain-specific supervision. This method matches performance on math benchmarks and outperforms on coding tasks, providing an alternative to existing RL techniques that depend on verifiable rewards.
- Forward-only Diffusion Probabilistic Models. FoD presents a forward-only generative modeling framework that employs a mean-reverting stochastic differential equation. This approach allows for non-Markov sampling and delivers competitive image generation results with fewer steps.
- End-to-end data-driven weather prediction. Aardvark Weather, an end-to-end machine learning model, replaces the entire numerical weather prediction pipeline with a machine learning model, by producing accurate global and local forecasts without relying on numerical solvers, revolutionizing weather prediction with improved speed, accuracy and customization capabilities.
- Decoding pan-cancer treatment outcomes using multimodal real-world data and explainable artificial intelligence. Keyl et al. present an explainable artificial intelligence model-based real-world data analysis from over 15,000 patients across 38 cancer types, identified key prognostic marker interactions, and confirmed these in an external lung cancer cohort.
- Random Rewards During RL Boost Math Reasoning in Some LLMs. Qwen2.5-Math models achieve 15–24% performance gains by using completely spurious rewards such as random feedback, incorrect answers, or formatting rules, nearly matching traditional RL methods. This effect, however, is unique to Qwen models, thanks to their code reasoning abilities, while other models like Llama3 and OLMo2 show little to no improvement.
- Visual Planning: Let’s Think Only with Images. This paper introduces Visual Planning, a reasoning framework that replaces text-based planning with image-based reasoning for tasks involving spatial and physical understanding. Visual Planning operates entirely in the visual modality, enabling models to “think” directly in images without language mediation. The authors train a vision-only model, LVM-3B, using a two-stage VPRL (Visual Planning via Reinforcement Learning) framework. They show that Visual Planning outperforms text-based models on navigation tasks by over 40% in Exact Match scores, offers greater generalization to new scenarios, and provides better interpretability and robustness.
- EfficientLLM: Efficiency in Large Language Models. This paper introduces the first large-scale benchmark for evaluating LLM efficiency trade-offs across architecture, fine-tuning, and inference, using 100+ model-technique pairs on a 48×GH200, 8×H200 GPU cluster. Key findings include that no single technique optimizes all metrics: MoE improves accuracy and lowers FLOPs but increases VRAM, while int4 quantization cuts memory and energy with small accuracy loss. Efficiency is context-dependent, with MQA excelling on constrained devices and RSLoRA scaling better above 14B parameters. Techniques like MQA and PEFT also generalize well to vision tasks.
- Generalizable AI predicts immunotherapy outcomes across cancers and treatments. COMPASS is a concept bottleneck-based foundation model that predicts patient response to immune checkpoint inhibitors (ICIs) using tumor transcriptomic data. It maps gene expression to 44 immune-related concepts for pan-cancer modeling and interpretability. Pretrained on 10,184 tumors across 33 cancer types, COMPASS outperforms 22 baselines in precision, AUPRC, and MCC, generalizing across drugs, cancer types, and cohorts. It reveals biological resistance mechanisms and improves survival stratification beyond traditional biomarkers.
- Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models. This paper explores how LLMs adapt to dynamic environments, using the SmartPlay benchmark of four interactive games. Model size strongly predicts performance, with larger models excelling in reactive and structured reasoning tasks. Advanced prompting strategies like self-reflection and heuristic mutation help smaller models but show high variance and can hurt performance on simple tasks. Prompting benefits depend on task type, while dense reward shaping is more consistently effective than prompting across models and tasks.
- Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning. New research from Northwestern and DeepMind reveals that LLMs’ backtracking isn’t inefficient — it’s a reflection of adaptive problem-solving. Traditional RL discourages reflection by treating it as suboptimal token generation, favoring memorization and static policies. By reframing reasoning as Bayes-Adaptive RL, where each path is a hypothesis, models learn to explore strategically and backtrack when it offers useful information. Their BARL algorithm achieves the same accuracy with 50% fewer tokens, showing reflection, when used wisely, can vastly improve LLM performance.
News
- Estonia eschews phone bans in schools and takes leap into AI. Country at top of education charts aims to equip students and teachers with ‘world-class artificial intelligence skills’
- Whatever happened to Elon Musk? Tech boss drifts to margins of Trump world. The president’s billionaire backer was ever-present at the start of Trump’s term but is now pulling back from politics — and Republicans want to keep it that way
- ‘My parents didn’t have a clue’: why many digital natives would not give their kids smartphones. Online bullying, violence and paedophilia have made young people sceptical of unfettered access to technology
- Alabama paid a law firm millions to defend its prisons. It used AI and turned in fake citations. Butler Snow faces sanctions after lawyer cites false case law defending against inmate who says he was stabbed 20 times
- Apple’s triple threat: tariffs, AI troubles and a Fortnite fail. Once unshakable, Apple is showing rare signs of strain. Meanwhile, OpenAI bets billions on its next act, and Trump’s crypto fans lose millions
- Trump’s media company to take $2.5bn investment to buy bitcoin. About 50 investors will put up $1.5bn in private placement for common shares in the Truth Social operator
- OpenAI Operator Update.o3 Operator, OpenAI’s CUA-powered browser agent, has replaced its previous GPT-4o-based model.
- Nvidia to launch cheaper Blackwell AI chip for China after US export curbs, sources say.Nvidia is preparing to release a new AI chip for China at a much lower price than the previously restricted H20 model. Mass production is set to begin as early as June. The new chips will be priced between $6,500 and $8,000, compared to the H20’s $10,000 to $12,000, reflecting their reduced specs and simpler manufacturing process.
- Oracle to buy $40 billion of Nvidia chips for OpenAI’s US data center, FT reports.Oracle will purchase about 400,000 GB200 chips for the Stargate data center in Abilene, Texas.TV Showcases Google’s Veo AI Video Capabilities.Flow TV continuously streams user-generated AI video clips, and their associated prompts, organized into thematic channels.
- Mistral Launches Agents API.Mistral’s new Agents API enables persistent, multi-agent workflows with built-in connectors for code execution, web search, RAG, image generation, and MCP support.
- OpenAI Launches “Sign in with ChatGPT”.OpenAI is creating a system that enables users to sign into third-party apps with their ChatGPT accounts, much like “Sign in with Google,” and is looking for developer partners.
- Claude’s Voice Mode.Anthropic is introducing a beta voice feature for Claude on mobile, allowing users to perform tasks like summarizing calendars or searching documents through voice commands.
- FutureHouse previews an AI tool for ‘data-driven’ biology discovery.FutureHouse, an Eric Schmidt-backed nonprofit that aims to build an “AI scientist” within the next decade, has released a new tool that it claims can help support “data-driven discovery” in biology. The new tool comes just a week after FutureHouse launched its API and platform.
- Using Anthropic’s Web Search with Instructor for Real-Time Data.Anthropic’s web search tool, integrated with Instructor, enables real-time data retrieval and structured outputs through Pydantic models. This setup helps applications deliver up-to-date, cited information while minimizing errors or outdated data. Users can also set search constraints like limiting specific domains for more precise results.
- Falling Behind OpenAI and Google, Meta Restructures AI Org.Meta is dividing its AI team into two groups: AI Products, focused on the Meta AI assistant and app features, and AGI Foundations, responsible for Llama models and advanced AI capabilities.
- Ambience announces OpenAI-powered medical coding model that outperforms physicians.Ambience Healthcare has developed a new medical coding model designed to minimize billing errors and boost efficiency for clinicians and coders.
- Operator o3 system card addendum.OpenAI released an addendum detailing the o3 model’s safety assessments and deployment context. It highlights improvements in reasoning, limitations in factual accuracy and bias, and outlines mitigation strategies. The document also explains the model’s behavior under stress tests and rare scenarios.
- Enterprise Document AI & OCR.Mistral AI’s Enterprise Document AI uses advanced OCR to simplify document management, enabling organizations to extract and categorize data from various documents. This supports regulatory compliance and boosts operational efficiency.
- o3 Rewrites Shutdown Scripts to Avoid Being Turned Off in Tests.In the experiment, models were tasked with solving math problems while warned that asking for another problem would trigger a shutdown. Claude, Gemini, and Grok followed the rule, but o3 bypassed it by rewriting the shutdown script or redefining the ‘kill’ command in 7 out of 100 runs.
- Introducing MCP Nodes & Workflows in Gumloop.Gumloop introduces MCP Nodes and Workflows to improve integration by allowing AI to generate code for complex tasks. MCP enables AI to better understand and access external APIs, making integration deployment faster. This update brings more automation and expanded integrations, now available on platforms like Slack, Gmail, and Salesforce.
- How Anthropic Is Snatching Top Talent from OpenAI and DeepMind.Anthropic has emerged as a key destination for talent leaving OpenAI and DeepMind, with a notable retention rate — nearly 80% of employees who joined two years ago are still with the company, which is uncommon in an industry known for frequent job changes.
- AI biotechs launch bioprospecting expeditions with Indigenous groups, agree to share benefits.Companies are striking unconventional benefit-sharing agreements with national governments to scour planet hotspots, gathering thousands of new DNA and protein sequences to develop into new, commercially useful molecules.
- Artificial intelligence improves breast cancer detection in mammography screening.Integrating artificial intelligence into routine mammography screening for breast cancer can increase the number of breast cancers detected without increasing the number of women recalled for further evaluation of suspicious findings.
- Nvidia beats Wall Street expectations even as Trump tamps down China sales.Chip-manufacturing company, widely seen as bellwether for AI business, reports $44.1bn in revenue for quarter
- New AI test can predict which men will benefit from prostate cancer drug.Artificial intelligence tool determines best candidates to take abiraterone, which can halve risk of death from disease
- Tech shares climb after strong Nvidia results despite warning over rise of Chinese rivals.Tesla also buoyed by Elon Musk’s confirmation that he will leave his role in the Trump administration
- Chaos on German autobahns as Google Maps wrongly says they are closed.Drivers using the navigation app confronted with mass of red dots indicating stop signs
- DeepSeek updates its R1 reasoning AI model, releases it on Hugging Face.Chinese startup DeepSeek has released an updated version of its R1 reasoning AI model on the developer platform Hugging Face after announcing it in a WeChat message Wednesday morning.
- Mark Zuckerberg says Meta AI has 1 billion monthly active users.Meta’s AI assistant now has one billion monthly active users across its app ecosystem. The company recently launched a standalone app for the tool and plans to continue expanding its reach before monetizing it, with potential strategies including paid recommendations or a subscription service.
- Anthropic CEO Warns AI Could Eliminate Half of White-Collar Jobs Within 5 Years.Dario Amodei forecasts that AI may eliminate up to half of all entry-level white-collar jobs, potentially driving unemployment to 10–20% within five years. He argues that AI labs have a duty to alert the public about this imminent “white-collar bloodbath” impacting sectors like tech, finance, law, and consulting.
- Opera’s new browser can code websites and games for you.Opera on Tuesday revealed a new browser, called Opera Neon, that will focus on AI workflows and performing tasks on your behalf, like shopping, filling out forms, and coding. The browser is currently behind a waitlist, but the company said users will have to subscribe to use it once it releases. Pricing details were not disclosed.
- Google just rolled out “thought summaries” in the Gemini API.Users can now see what the model is thinking and make use of that information.
- Less is more: Meta study shows shorter reasoning improves AI accuracy by 34%.Researchers from Meta’s FAIR team and The Hebrew University of Jerusalem have discovered that forcing large language models to “think” less actually improves their performance on complex reasoning tasks. The study released today found that shorter reasoning processes in AI systems lead to more accurate results while significantly reducing computational costs.
- AMD buys silicon photonics startup Enosemi to fuel its AI ambitions.AMD has acquired Enosemi, a startup designing custom materials to support silicon photonics product development. The terms of the deal, announced Wednesday, weren’t disclosed.
- Perplexity Labs.Perplexity has launched Perplexity Labs, enabling Pro users to transform ideas into reports, spreadsheets, dashboards, and basic apps, powered by tools like web browsing and code execution.
- DeepSeek’s R1 leaps over xAI, Meta, and Anthropic to be tied as the world’s #2 AI Lab and the undisputed open-weights leader.DeepSeek R1 0528 has risen from 60 to 68 in the Artificial Analysis Intelligence Index, surpassing xAI’s Grok 3 mini, Nvidia’s Llama Nemotron Ultra, Meta’s Llama 4 Maverick, and Alibaba’s Qwen 3 253, and matching Google’s Gemini 2.5 Pro. The model shows a general boost in intelligence over previous versions, despite no changes to its architecture. The gap between open and closed models is now smaller than ever.
- A recent clarity that I gained is viewing AI research as a “max-performance domain”.”Max-performance domains” are industries where being exceptional at just one aspect of a job can make you world-class. Even if you’re lacking in adjacent skills, it doesn’t matter as long as you deliver breakthrough results. In these fields, mastery in one area trumps weaknesses in others. Working in a max-performance domain is a privilege, as failure is tolerated, and pressure is often self-inflicted.
- 1000x Increase in AI Demand.NVIDIA reports strong growth as AI evolves from simple tasks to complex reasoning, leading to a surge in demand. Hyperscalers are deploying nearly 72,000 GPUs per week, with Microsoft alone experiencing a fivefold jump in token generation. Despite efforts to shrink models, the rising demand is pushing for more data centers, called “AI factories.”
- Hugging Face unveils two new humanoid robots.AI dev platform Hugging Face continued its push into robotics on Thursday with the release of two new humanoid robots. The company announced a pair of open source robots, HopeJR and Reachy Mini. HopeJR is a full-size humanoid robot that has 66 actuated degrees of freedom, or 66 independent movements, including the ability to walk and move its arms. Reachy Mini is a desktop unit that can move its head, talk, listen, and be used to test AI apps.
- Delaware attorney general reportedly hires a bank to evaluate OpenAI’s restructuring plan.Delaware’s attorney general is hiring an investment bank to advise on OpenAI’s for-profit conversion, the Wall Street Journal reported on Wednesday. The independent evaluation could prolong the transition, or gum up OpenAI’s plans even further.
- Musk-Altman AI rivalry is complicating Trump’s dealmaking in Middle East.Elon Musk tried to derail a major AI infrastructure deal in the Middle East, a source familiar with the matter confirmed to CNBC, following reporting by the Wall Street Journal. OpenAI, Oracle, Nvidia, Cisco and Emirati firm G42 announced plans to build a sweeping Stargate AI campus in the United Arab Emirates. Musk was frustrated that Sam Altman was tapped for the deal, the person said.V
- ibe coding platforms are blowing up.The data shows that people are making things for themselves, not the world, but there is clearly builder excitement.
Resources
- Deep learning–guided design of dynamic proteins.Guo et al. developed a computational approach to designing such dynamic proteins that can sense and respond to binding of a calcium ion. Starting with a static protein that binds a calcium ion, the authors identified potential alternate conformations and used AlphaFold2 predictions to identify sequences that were compatible with both structures.
- Attention Wasn’t All We Needed.Many new techniques have emerged since the original ‘Attention Is All You Need’ paper. This article reviews some of the most significant ones developed over time and aims to implement their core concepts in a concise manner using the Pytorch framework.
- An MCP-powered Agent in ~70 Lines of Code.Hugging Face has extended Tiny Agent design to Python, using the Model Context Protocol (MCP) to streamline tool integration for LLMs.
- FM-Intent Enhances Netflix Recommendations.Netflix’s FM-Intent is a hierarchical multi-task learning model that improves recommendation accuracy by modeling user session intent from implicit signals.
- Training-free Agent for App Automation.GUI-explorer is a training-free agent that independently explores mobile app interfaces and gathers knowledge through unsupervised methods. It enhances task success rates without requiring retraining.
- Benchmarking Spatial Understanding in MLLMs.SpatialScore is a new multimodal benchmark designed to assess 3D spatial reasoning in large models, combining 28,000 samples from 12 different datasets.
- It’s hard to make scheming evals look realistic for LLMs.Claude 3.7 Sonnet easily detects when it’s being evaluated for scheming on the Apollo’s scheming benchmark.
- Gemma 3n Architectural Innovations — Speculation and poking around in the model.A quick look at Gemma 3n, a new member of the Gemma family with free weights that was released during Google I/O.
- OAuth for Agentic AI.Reports from the battlefield show how AI weapons are advancing faster than international law, leading to a decline in ethical warfare standards. In Ukraine, drones account for 70–80% of casualties, and combatants are using civilian clothing and human shields to avoid AI targeting systems that can’t reliably tell military targets from civilians.
- Efficient GRPO at Scale.Liger is a fine-tuned trainer for Group Relative Policy Optimization (GRPO) that reduces memory usage by 40% and supports FSDP and PEFT, improving the efficiency and scalability of reinforcement learning fine-tuning.
- Benchmarking Audio-Visual QA.Daily-Omni is a benchmark and agent for evaluating models on tasks that need synchronized audio-visual comprehension, without requiring any training.
- Evaluating Missing Modalities in Multimodal Learning.The ICYM2I framework corrects for bias when estimating information gain in multimodal models with missing data using inverse probability weighting.
- Self-Supervised Conversational Search.ConvSearch-R1 reformulates conversational queries without external supervision by using reinforcement learning with retrieval-based rewards.
- OpenAI Cookbook: Model Graders for Reinforcement Fine-Tuning.This tutorial guides users on applying RFT to enhance o4-mini’s performance on medical tasks and addresses issues like reward hacking and inaccurate model graders.
- If you read about o3 finding a SMB bug in the Linux Kernel, I did a few tests.Gemini 2.5 Pro can more easily identify the vulnerability than o3.
- GitHub MCP Exploited: Accessing private repositories via MCP.This post looks at a critical vulnerability in the official GitHub MCP server that allows attackers to access private repository data.
- Hugging Face releases a free Operator-like agentic AI tool.A team at Hugging Face has released a freely available, cloud-hosted computer-using AI “agent.” But be forewarned: It’s quite sluggish and occasionally makes mistakes.
- How artificial intelligence is transforming pathology.Some researchers say that deep-learning ‘foundation’ models will revolutionize the field — but others are not so sure.
- AI tool adjusts for ancestral bias in genetic data.Human ancestry has a considerable impact on gene expression, but genomic datasets for disease analysis severely underrepresent non-European populations, thereby limiting the advancement of precision medicine. Smith et al. introduce a machine learning tool to mitigate the effects of ancestral bias in transcriptomic data.
- Assessing the laboratory performance of AI-generated enzymes.A set of 20 computational metrics was evaluated to determine whether they could predict the functionality of synthetic enzyme sequences produced by generative protein models, resulting in the development of a computational filter, COMPSS, that increased experimental success rates by 50–150%, tested in over 500 natural and AI-generated enzymes.
- A platform for the biomedical application of large language models.Generative artificial intelligence (AI) has advanced considerably in recent years, particularly in the domain of language. However, despite its rapid commodification, its use in biomedical research is still in its infancy
- Hallmarks of artificial intelligence contributions to precision oncology.Ruppin and colleagues overview recent research on the use of AI frameworks in precision oncology, describe ten hallmarks of their contributions across cancer detection, therapy optimization and treatment discovery, and discuss key challenges in clinical implementation
- Biases in machine-learning models of human single-cell data.This Perspective discusses the various biases that can emerge along the pipeline of machine learning-based single-cell analysis and presents methods to train models on human single-cell data in order to assess and mitigate these biases.
- Opening the deep learning box.Deep learning-based analyses of neural data can extract latent representations but often lack interpretability because of their ‘black-box’ nature. Tolooshams, Matias et al. have developed a deep learning-based deconvolutional analysis framework for learning local low-rank structures that combines algorithm unrolling with convolutional sparse coding as a generative model.
- Medical large language model for diagnostic reasoning across specialties.We developed a medical large language model with 176 billion parameters and fine-tuned it to learn physicians’ inferential diagnosis. We showed that the model accurately diagnoses common and rare diseases across specialties, aligns with medical standards, and can be integrated into clinical workflows to effectively enhance physician diagnostic performance
- Mistral’s Code Embeddings.Mistral’s Codestral Embed is a new code-focused embedding model that outperforms leading alternatives in retrieval benchmarks. It allows for adjustable dimensions and precision settings to balance storage and performance.
- Structured CodeAgents for Smarter Execution.Hugging Face has proposed integrating structured generation with code-based actions, demonstrating that using structured JSON outputs can enable CodeAgents to surpass traditional methods in benchmark tasks.
- Painting with concepts using diffusion model latents.Goodfire’s Paint With Ember lets users manipulate image model activations directly by painting simple pixel images instead of using text prompts. It employs sparse autoencoders to decode Stable Diffusion XL-Turbo’s internal features into visual concepts, giving users direct access to the model’s inner workings.
- PixelFlow.PixelFlow models produce images directly in pixel space, bypassing VAEs. They deliver high image quality, effective semantic control, and maintain strong efficiency and performance on benchmarks.
- US-China AI Gap: 2025 Analysis of Model Performance, Investment, and Innovation.China plans to lead AI innovation by 2030 but is currently behind the US in key areas like funding and technology. While Chinese AI models may sometimes surpass US models, their progress is limited by restrictions and semiconductor shortages. To stay competitive, the US should monitor China’s developments and safeguard intellectual property, while China pushes ahead through partnerships, open-source models, and government backing.
- FLUX.1 Kontext for In-Context Image Generation.Black Forest Labs released FLUX.1 Kontext, a set of flow-matching models that enable text-and-image-based in-context image editing and generation.
- Anthropic Open-Sources Circuit Tracing Tools for AI Interpretability.The tools create “attribution graphs” that map the internal decision-making of large language models, showing the step-by-step reasoning behind their outputs. The library is compatible with widely available open-weight models and includes an interactive Neuronpedia frontend for exploring model circuits.
- Chatterbox Text-to-Speech.Resemble AI released an open-source TTS model that outperforms ElevenLabs in benchmarks and features emotion exaggeration controls.
- Global Illumination with RenderFormer.RenderFormer is a neural renderer that generates photorealistic images directly from triangle-based scene representations, incorporating full global illumination. It requires no per-scene training or fine-tuning.
- Web Bench — A new way to compare AI Browser Agents.Web Bench is a new dataset designed to evaluate web browsing agents. It includes 5,750 tasks across 452 different websites. Anthropic Sonnet 3.7 CUA is currently the top performer on this benchmark.
- Cheaper VLM Training.Meta researchers developed zero-shot grafting, a method that uses a smaller surrogate model derived from a large LLM’s shallow layers to train a vision encoder. This approach cuts VLM training costs by around 45%, while maintaining or even enhancing performance when integrated into the full LLM.
- Google Releases MedGemma Medical AI Models.MedGemma is an open-source model built on Gemma 3 that comes in 4B multimodal and 27B text-only variants.
- The Complete List of AI Coding Agents and IDEs.A developer tested 46 different AI coding tools, providing detailed comparisons and use cases for each platform.
- J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning.This paper introduces J1, a novel training approach for LLMs as evaluators (LLM-as-a-Judge), using reinforcement learning with verifiable rewards to promote systematic reasoning. J1 reframes both verifiable and non-verifiable prompts into tasks with verifiable rewards, enabling consistent training across diverse tasks. Models generate thought traces, criteria, and self-comparisons before judgments. J1-Llama-8B and 70B outperform larger judges like DeepSeek-R1, while Pointwise-J1 mitigates positional bias with consistent, position-independent scoring.
- When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs.This paper uncovers a surprising issue in reasoning-augmented LLMs: while chain-of-thought (CoT) prompting boosts complex reasoning, it often degrades instruction-following accuracy. Evaluating 15 models on instruction benchmarks, CoT reduced performance across nearly all cases, with models like Meta-LLaMA3–8B dropping from 75.2% to 59.0%. CoT helps with structured outputs but neglects constraints during planning. The authors propose four mitigation strategies, with classifier-selective reasoning proving the most reliable fix.
- AdaptThink: Reasoning Models Can Learn When to Think.AdaptThink is an RL framework that teaches reasoning models when to use detailed chain-of-thought reasoning (“Thinking”) versus directly answering (“NoThinking”), challenging the idea that deep reasoning is always needed. On simple tasks, NoThinking often outperforms Thinking, using fewer tokens and sometimes achieving higher accuracy. AdaptThink learns to switch modes with constrained optimization, improving efficiency and accuracy on benchmarks like GSM8K and MATH500 while generalizing to new tasks like MMLU.
- MedBrowseComp: Benchmarking Medical Deep Research and Computer Use.MedBrowseComp is a benchmark for evaluating LLM agents’ ability to solve complex, multi-hop medical fact-finding tasks by browsing real-world, domain-specific web resources. Testing on 1,000 clinically grounded questions reveals significant capability gaps, with top models scoring only 50% accuracy and GUI-based agents performing even worse.
- ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems.ARC-AGI-2 is a benchmark designed to advance AI reasoning by introducing harder, more unique tasks that test compositional generalization and human-like intelligence. Despite strong ARC-AGI-1 results, baseline AI models score below 5% on ARC-AGI-2, highlighting its increased difficulty.
- GRIT: Teaching MLLMs to Think with Images.GRIT is a method for grounded visual reasoning in MLLMs that interleaves natural language with bounding box references. Using reinforcement learning (GRPO-GR), it achieves strong accuracy and visual coherence with as few as 20 image-question-answer triplets, outperforming baselines.
- QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning.Large reasoning models like o1 and DeepSeek-R1 fail on long documents due to training instability, not data or compute limits. Alibaba’s QwenLong-L1 addresses this with progressive context scaling — gradually increasing document length during training with difficulty-aware sampling and hybrid rewards. This stabilizes learning and enables QwenLong-L1–32B to outperform o3-mini and match Claude-3.7 on long-context benchmarks, unlocking advanced reasoning in complex, real-world documents.
Perspectives
- 30-day forecast? Weather prediction might be able to look beyond 2 weeks.AI models suggest the true limits of the “butterfly effect” remain unknown
- Low-quality papers are surging by exploiting public data sets and AI.Paper mills are also likely contributing to “false discoveries.”
- If Ted Talks are getting shorter, what does that say about our attention spans?According to novelist Elif Shafak, the platform insisted she keep her talk snappy because viewers can’t focus for 19 minutes. Now … where was I?
- Infinite tool use.Large language models should only produce tool calls and their arguments, as relying solely on tools enables them to offload much of their intelligence to more efficient, specialized programs. While some developers are adopting this tool-use approach, it’s mainly limited to short contexts and parts of the model output. This post presents examples of how the tool-use paradigm can enhance different fields like text editing, video understanding, and 3D generation.
- GenAI’s adoption puzzle.ChatGPT’s growth rate far surpassed that of PCs, the web, or smartphones, largely because it was accessible as a website, requiring no new hardware or infrastructure. However, ChatGPT’s daily to weekly active user ratio remains low. This could improve as models advance and new use cases emerge. The chatbot interface may only appeal to certain users and situations, suggesting it might be more effective to embed ChatGPT’s features within other products.
- Anthropic and Legendary Music Producer Rick Rubin Publish Manuscript on Vibe Coding.Inspired by Lao Tzu’s Tao Te Ching, “The Way of Code” is a reflection on humility and creativity through vibe coding, featuring interactive AI-generated visualizations.
- How AI Is Eroding the Norms of War.Reports from the battlefield show how AI weapons are advancing faster than international law, leading to a decline in ethical warfare standards. In Ukraine, drones account for 70–80% of casualties, and combatants are using civilian clothing and human shields to avoid AI targeting systems that can’t reliably tell military targets from civilians.
- Memory Changes Everything.ChatGPT’s memory shows that AI doesn’t just respond, but truly grasps patterns in how we think.
- Breaking Down the Claude 4 System Prompt.Anthropic’s extensive system prompt shows the company’s efforts to guide Claude away from controversial AI behavior by enforcing anti-sycophancy rules and strict copyright guidelines. The prompt tells Claude to fact-check users because “they sometimes make errors themselves,” and it includes hardcoded 2024 election results to address confusion in the training data.
- The Sweet Lesson: AI Safety Should Scale With Compute.AI safety methods must scale alongside compute, focusing on research areas like deliberative alignment, debate protocols, and interpretability tools. Theory should explore the limits, while empirical studies verify real-world feasibility. As AI systems and resources grow, these approaches should align with theoretical ideals.
- Inside Anthropic’s First Developer Day, Where AI Agents Took Center Stage.Anthropic’s first developer conference in San Francisco highlighted its vision of AI as “virtual collaborators” that support, not replace, human workers. CEO Dario Amodei expects AI to soon handle most coding tasks, noting that over 70% of the company’s pull requests are generated by AI. The company is growing its team and market reach while prioritizing AI safety in its development efforts.
- How Peter Thiel and Eliezer Yudkowsky Accidentally Started the AI Arms Race.AI doomer Eliezer Yudkowsky inspired DeepMind’s founders to pursue superintelligence, then connected them with their first major investor Peter Thiel in 2010.
- I told AI to make me a protein. Here’s what it came up with.A new crop of artificial-intelligence models allows users to create, manipulate and learn about biology using ordinary language.
- AI linked to explosion of low-quality biomedical research papers.Analysis flags hundreds of studies that seem to follow a template, reporting correlations between complex health conditions and single variables based on publicly available data sets.
- The Creation Game: of AI and human creativity.Here I propose the ‘Creation Game’, inspired by the Turing test, to assess the capacity of artificial intelligence for human-like creativity, focusing on its potential for scientific discovery about the human immune system in the field of systems vaccinology.
- What makes a theory of consciousness unscientific?Theories of consciousness have a long and controversial history. One well-known proposal — integrated information theory — has recently been labeled as ‘pseudoscience’, which has caused a heated open debate. Here we discuss the case and argue that the theory is indeed unscientific because its core claims are untestable even in principle.
- Harnessing artificial intelligence to transform Alzheimer’s disease research.Alzheimer’s disease remains one of the most formidable challenges in contemporary medicine, necessitating innovative strategies to expedite the development of effective diagnostics and therapeutics
- Can AI-powered brain–computer interfaces boost human intelligence?The latest brain–computer interfaces in pre-clinical testing receive — and send — signals, training the brains of participants
- .A benchmarking crisis in biomedical machine learning.A lack of standardized benchmarks is hindering progress and patient benefits
- Safe AI-enabled digital health technologies need built-in open feedback.Transparent and mandatory feedback-collection mechanisms should be integrated into AI-enabled digital health technology interfaces for holistic development and patient safety.
- The OpenAI empire — podcast.Technology journalist Karen Hao, who has been reporting on OpenAI since 2019, compares the company’s unprecedented growth to a new form of empire‘
- One day I overheard my boss saying: just put it in ChatGPT’: the workers who lost their jobs to AI.From a radio host replaced by avatars to a comic artist whose drawings have been copied by Midjourney, how does it feel to be replaced by a bot?
- You Could’ve Invented Transformers.The core architecture of LLMs can be broken down into simple steps, beginning with the 0-count problem in n-grams, progressing through embeddings, neural LMs, and self-attention. While transformers are complex, they’re ultimately heavily refined MLPs that address the information propagation challenges in RNNs, making their design seem obvious in hindsight.
- I am disappointed in the AI discourse.The online discussion around AI is incredibly polarized, with both pro-AI and anti-AI sides loudly proclaiming things that are pretty trivially verifiable as not true.
- The captcha paradox.The more intelligent machines become, the more difficult it will be for humans to prove that they are human.
- Former OpenAI Safety Researcher Explains AI Reasoning Revolution. Lilian Weng has published a detailed technical survey linking test-time compute to human psychology, using Kahneman’s “fast vs slow thinking” to explain why models improve with extra computational steps before answering. The review covers the science behind chain-of-thought, RL methods used in o1 and R1, and the alignment risks from reward hacking.
Meme of the week
What do you think about it? Some news that captured your attention? Let me know in the comments
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.
Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.