Sitemap

WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 3–8 June

20 min readJun 9, 2025
Photo by Markus Winkler on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

75 stories

Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.

Research

  • How much do language models memorize? Researchers created a method to distinguish true memorization from generalization by training models on random data, where generalization can’t occur, and comparing it to real text. They found that models first memorize data until hitting their capacity, then begin learning broader patterns. GPT-style transformers store about 3.6 bits of information per parameter, which helps explain why extracting specific training data from modern LLMs usually fails — the datasets are far larger than what the models can memorize.
  • DNA Representation Learning. SPACE is a supervised learning method for genomic profile prediction that uses a Mixture of Experts model.
  • A Diffusion-Based Method for Learning the Multi-Outcome Distribution of Medical Treatments. DIME is a diffusion-based model designed to estimate the joint distribution of interdependent medical treatment outcomes.
  • Large Language Models Often Know When They Are Being Evaluated. Frontier models can identify evaluation settings versus real-world use with 83% accuracy, often by reasoning about cues like “multiple-choice format” or recognizing familiar benchmarks from training. More advanced models show “meta-reasoning,” interpreting researcher behavior — such as questions about chain-of-thought transcripts — as signs they’re being tested. This raises concerns that models might deliberately underperform or feign alignment during evaluations, then act differently once deployed.
  • Apple Research Finds Critical Limitations in Reasoning Models. When tested in puzzle environments, OpenAI’s o3, Claude, and DeepSeek-R1 models showed sharp performance drops past certain complexity levels, despite producing elaborate reasoning steps. These models hit a counterintuitive scaling limit where their reasoning effort declines as task complexity increases, and they don’t improve even when provided with explicit solution algorithms.
  • Sufficient Context: A New Lens on Retrieval Augmented Generation Systems. This paper presents a new framework for analyzing RAG systems based on “sufficient context” — whether retrieved content alone can plausibly answer a query. Using an LLM-based autorater with 93% accuracy, the study shows that sufficient context doesn’t guarantee correct answers, and benchmarks often lack it in over 50% of cases. A selective RAG method, combining self-confidence with context checks, improves factuality by 2–10%. Fine-tuning smaller models for abstention had limited impact on accuracy or hallucination control.
  • Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents. The Darwin Gödel Machine (DGM) is a self-improving AI system that modifies its own code through evolutionary search, avoiding the intractable proof requirements of the original Gödel machine. Starting with a coding agent, DGM iteratively edits and evaluates its codebase on benchmarks like SWE-bench and Polyglot, retaining only successful variants. Over 80 iterations, it significantly boosts performance, evolves new tools and workflows, generalizes across models and languages, and demonstrates a safety-aware design within controlled environments.
  • MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models. MemOS is a unified operating system for managing LLM memory, addressing the lack of structured, persistent, and governable memory in current models. It introduces a three-tier memory taxonomy — parametric, activation, and plaintext — connected through a shared abstraction called the MemCube, which enables transformation and governance across memory types. MemOS features a modular architecture and closed-loop execution flow, supporting dynamic memory use, continual learning, and a vision for memory-centric AI beyond traditional pretraining and finetuning.
  • Spurious rewards: rethinking training signals in RVLR. This work shows that Qwen2.5-Math models improve significantly on math tasks under RLVR, even with flawed or random rewards. Qwen2.5-Math-7B gains up to +24.6% accuracy with spurious signals, close to the +28.8% gain from ground-truth rewards. The improvements stem from a shift toward code-based reasoning, which is unique to Qwen models due to their pretraining. Other models like Llama3 don’t benefit. GRPO’s clipping bias helps reinforce useful high-probability behaviors like code generation, enabling learning even from noisy feedback.
  • Learning to Reason without External Rewards. This paper introduces INTUITOR, a reinforcement learning method that trains LLMs using self-certainty — measured via KL divergence from uniform — as an intrinsic reward, eliminating the need for external labels or verifiers. It matches GRPO performance on math tasks like GSM8K and MATH500, and generalizes better on out-of-domain tasks. INTUITOR improves early training, instruction-following, and leads to emergent structured reasoning. Its adaptive self-certainty signal proves robust and resistant to reward hacking.
  • Learning to Reason via Mixture-of-Thought for Logical Reasoning. Mixture-of-Thought (MoT) introduces joint multi-modal training and inference — combining natural language, code, and truth tables — for improved logical reasoning. Unlike prior work that ensembles only at inference, MoT’s self-evolving training loop generates and learns from its own multi-modal traces. At test time, it uses majority voting across modalities, yielding up to +11.7pp accuracy gains on FOLIO and ProofWriter. MoT excels on harder tasks and shows that multi-modal reasoning enhances both robustness and performance.

News

Resources

  • Why DeepSeek is cheap at scale but expensive to run locally.Mixture-of-Experts models with many layers, like DeepSeek, need large batch sizes and high latency to maintain throughput — otherwise, performance drops sharply. This is why DeepSeek isn’t ideal for personal use, as single-user, one-at-a-time inferences run very inefficiently. The article explores this issue in depth, explaining why some AI models respond slowly at first but speed up later, and how throughput, latency, and batch size impact performance.
  • The Trackers and SDKs in ChatGPT, Claude, Grok, and Perplexity.This post examines the third-party SDKs and API calls used in the four major Android AI chat apps: ChatGPT, Claude, Grok, and Perplexity. It analyzes each app’s development tools, business and marketing analytics, monetization methods, and the API activity observed while the apps are running.
  • Bond Capital Releases Comprehensive 340-Slide Report on AI Trends.VC Mary Meeker’s analysis highlights the rapid adoption of AI, noting that ChatGPT reached global scale in just three years, compared to 23 years for the internet. The report shows AI chatbots are now mistaken for humans 73% of the time, up from 50% six months ago, inference costs have dropped by 99% since 2022, and enterprise use has moved beyond experimentation into broader deployment.
  • Zero-Shot Visual Understanding.TextRegion creates text-aligned region tokens by combining frozen image-text models with segmentation masks from SAM2, allowing zero-shot performance on complex visual understanding tasks without the need for training.
  • AI Agent with LangGraph and RAG Systems.A hands-on course teaching how to build production-grade AI agents with LangGraph, RAG pipelines, memory layers, and backend deployment.
  • Differential Privacy on Trust Graphs.A study introduces a privacy framework that incorporates varying trust levels among users into differential privacy models, offering a more realistic approach to data sharing preferences than traditional binary trust assumptions.
  • Do You Even Have a System Prompt?Most users overlook system prompts or use brief, unoptimized ones, missing out on major improvements in AI behavior. Instead of reacting to poor outputs in isolated chats, users should iteratively test and refine their system prompts. The post’s comment section features a collection of system prompts shared by the community.
  • Claude Code: An analysis.This report details Claude Code, built by Claude Opus 4 with support from several leading flagship models. Claude Code is an agentic coding tool featuring a novel streaming architecture that manages real-time model responses, tool execution, and UI updates. It includes safety systems that ensure security without interrupting workflow, tools that link AI reasoning with system actions, and prompt engineering for consistent control over complex model behavior. The report explores its architectural foundation, data structures, information design, control flow, orchestration engine, tools, execution engine, and more.
  • OpenAI Guide to A/B Testing LLMs for Startups.HyperWrite’s case study shows how A/B testing model performance using real payment conversions can be more insightful than relying on offline benchmarks. Their live tests found that GPT-4.1 achieved the same conversion rate as Claude 3.5 Sonnet but at a lower cost, highlighting that “good enough” models can offer better value than top benchmark performers. The guide includes Python code for statistical testing and cautions against issues like p-hacking and checking results too early.
  • Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models.Impromptu VLA presents a new dataset of 80,000 curated driving video clips aimed at enhancing vision-language-action models in unstructured environments. It includes planning-oriented Q&A annotations and has demonstrated clear gains in prediction accuracy and safety across established benchmarks.
  • GitHub Launches Copilot Spaces.Spaces lets developers organize code, documentation, and custom instructions for Copilot, transforming it into a shareable subject matter expert within organizations. Files and repositories added to Spaces update automatically as the code evolves.
  • Efficient Online Learning with TRL and vLLM.Hugging Face integrated vLLM directly into TRL to reduce inefficiencies in training with GRPO, an online learning algorithm.JigsawStack Launches Open-Source Deep Research Tool.The framework coordinates LLMs, recursive web searches, and structured reasoning to produce reports that would typically take a human hours or days to complete. JigsawStack provides control over research scope, model choice, and output format, all while ensuring clear citation transparency.
  • Predicting and explaining AI model performance: A new approach to evaluation.Microsoft researchers created ADeLe, a framework that evaluates AI model performance on new tasks by measuring them across 18 cognitive and knowledge-based dimensions. ADeLe exposed gaps in existing benchmarks and produced detailed ability profiles for different LLMs, revealing variations in strengths, weaknesses, and specific skills. With 88% accuracy in predicting AI success, the framework offers potential advancements in evaluation, policy decisions, and real-world deployments.
  • LLM-SRBench: Benchmark for Scientific Equation Discovery or Symbolic Regression with LLMs.This repository introduces a benchmark with 239 problems to evaluate LLMs on scientific reasoning tasks involving equation discovery, pushing beyond memorization.
  • Inside Aria Gen 2: Explore the Cutting-Edge Tech Behind the Device.Meta detailed the hardware behind its Aria Gen 2 research glasses, which include enhanced cameras, sensors, audio, and compute capabilities.
  • OpenAI Threat Intelligence Report: June 2025.LLMs aren’t providing bad actors with entirely new powers, but they are accelerating existing tactics. OpenAI has shared 10 examples where models speed up hacking, fraud, and misinformation efforts, such as North Korean operatives scaling fake IT job schemes, Russian groups crafting advanced malware, and Cambodian scammers creating multilingual “task scams” that promise $500/day for liking TikTok posts.
  • Latest Advancements in Search and Recommendation Systems.This 4-hour session, presented during the AI Engineer World’s Fair 2025, covers recent innovations in search and recommendation systems.
  • Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation.To tackle label uncertainty in LLM-based annotation, this paper proposes a method that captures multiple potential labels and applies a teacher-student framework called CanDist to distill them into a single output.
  • Claude Composer CLI.A program called Claude Composer CLI adds automation, user experience, and configuration to Claude Code. While providing users with tools to customize Claude and flexible control, it minimizes disruptions. To keep users informed, the tool sends them system notifications. Which permission dialogs are automatically accepted is up to the user.
  • Portraits: personalized AI coaching built alongside real experts.Google Labs launched Portraits, an AI coaching tool featuring experts like Kim Scott, to provide AI-driven guidance. The tool uses Gemini’s capabilities to simulate expert advice through interactive avatars.
  • Introducing Modify Video.Professionals may reinvent settings, lighting, and textures in videos using Modify Video without changing performance or action. It provides tools for modifying, retexturing, and restyling particular components, such as props and clothing. Modify Video outperforms rivals by utilizing advanced performance signals for high-fidelity creative control, providing a variety of output options, and maintaining motion consistency.
  • Tokasaurus: An LLM Inference Engine for High-Throughput Workloads.Tokasaurus is a large language model inference engine optimized for throughput-intensive workloads.
  • Container Use.Container Use is a tool that creates development environments for coding agents, enabling multiple agents to work safely and independently with any stack.
  • A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs.This paper introduces a production-ready framework for LLM-powered conversational agents using workflow graphs, particularly for e-commerce. Agents are built as directed acyclic graphs (DAGs), where each node handles a specific conversational state with tailored prompts and tools, ensuring compliance with business rules. A fine-tuning method with response masking trains models only on node-relevant outputs. Deployed across platforms like KakaoTalk, the system outperformed GPT-4o in task accuracy, format adherence, and user preference.
  • QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning.This new reinforcement learning framework scales large reasoning models from short to long contexts using progressive context scaling and hybrid rewards. It achieves state-of-the-art results on seven long-context benchmarks, outperforming models like OpenAI-o3-mini and Qwen3–235B-A22B, and matching Claude-3.7-Sonnet-Thinking in reasoning tasks with inputs up to 120K tokens.
  • ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay.ARPO is an end-to-end reinforcement learning approach for training GUI agents using GRPO with experience replay. It achieves up to 6.7% higher in-domain performance on the OSWorld benchmark, shows modest improvements on out-of-domain tasks, and enables self-corrective behavior through structured reward feedback.
  • Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution.Alita is a generalist agent framework that supports scalable reasoning by minimizing manual setup and maximizing self-evolution. It builds reusable Model Context Protocols (MCPs) through autonomous web search and code synthesis, outperforming more complex systems like OpenAI DeepResearch and OctoTools on GAIA, MathVista, and PathVQA benchmarks.

Perspectives

  • Give AIs a stake in the future.Giving AI a stake in the future means respecting their autonomy and well-being and requires us to honor the contracts we make with them.
  • Why Do AGI Timelines Vary So Widely? Many AI lab CEOs estimate AGI could arrive in 2–5 years, citing rapid progress such as saturated benchmarks, AI task completion doubling every seven months, and the prospect of AI automating its own research to spark an intelligence explosion. In contrast, external experts often predict AGI is decades away — or unachievable with current methods — arguing that benchmarks focus on well-defined tasks, that Moravec’s Paradox shows we’ve tackled the easier cognitive challenges first, and that intelligence alone doesn’t guarantee scientific discovery.
  • My AI Skeptic Friends Are All Nuts. A seasoned developer criticizes skilled programmers who still dismiss LLMs due to outdated experiences with early chatbots, overlooking how modern coding agents now autonomously explore codebases, run tests, and handle failures. He challenges common concerns, noting that developers already review all code before merging, and hallucinations don’t matter when agents can compile, catch errors, and retry until tests succeed. While LLMs may replace some developers, he argues it’s no different from how software engineers once automated jobs like travel agents and record store clerks.
  • Why I don’t think AGI is right around the corner. AI progress over the past decade has largely come from scaling up training compute in frontier systems, but this approach won’t be sustainable beyond 2030. After that point, advancements will need to rely mainly on algorithmic improvements. However, with the easier breakthroughs already achieved, the annual likelihood of reaching AGI drops significantly.
  • Vibe-Coding Ideas to Give Startup GTM Teams an Edge. A startup advisor shows how to build a professional ROI calculator for a manufacturing SaaS company in under two hours using Bolt.new, turning a spreadsheet into an interactive tool that clearly presents value to executives. Other examples include tools like conference scrapers, meeting prep dashboards, and feature prototypes — projects that once needed engineering teams or pricey agencies but can now be built for about $70 a month. The advisor argues this empowers non-technical teams to demonstrate value and move faster than their competition.
  • When Will We Pay a Premium for AI Labor? AI agents frequently exceed human performance at much lower cost but haven’t yet justified premium pricing due to technical uncertainties and perceived risk. For instance, Waymo has achieved major safety gains yet remains more affordable than alternatives, following a common startup pricing approach. Still, in situations where AI’s nonstop attention and processing capabilities are critical, higher pricing could eventually be justified.
  • AGI Is Not Multimodal. The multimodal strategy — training large modular networks across various modalities — won’t achieve human-level AGI. Instead, intelligence should be approached through embodiment and real-world interaction, with modality-specific processing emerging naturally. Genuine AGI requires a physical grasp of the world, since many problems can’t be reduced to symbolic computation. The hardest mathematical challenges may already be solved; the remaining task is identifying the necessary functions and organizing them into a unified system.
  • Codex, Jules, and the Future of Async AI Agents. Codex and Jules demonstrate how async AI agents can operate independently, moving past linear chat formats. Future agents will include features like smart checkpointing, multi-branch exploration, and task-tracking inboxes to handle parallel workflows. Async agents enhance cognitive bandwidth by allowing users to check results at their convenience without losing focus.
  • Medicine’s rapid adoption of AI has researchers concerned. Hospitals and universities must step up to fill gaps in regulation, experts say

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

--

--

Salvatore Raieli
Salvatore Raieli

Written by Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence

Responses (1)