WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 27 January — 2 February

DeepSeek AI’s Competitive Leap, US Tightens AI Chip Export Controls, China’s AI Dominance Threatens US Stocks, ElevenLabs Raises $250M for AI Voice Tech, US Accuses China of Intellectual Property Theft. and much more

--

Photo by Filip Mishevski on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

63 stories

Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.

Research

  • Kimi 1.5: Scaling RL with LLMs. Kimi has unveiled k1.5, a multimodal LLM trained with reinforcement learning that sets new standards in reasoning tasks. The model supports long context processing up to 128k tokens and employs enhanced policy optimization methods, offering a streamlined RL framework without relying on complex techniques like Monte Carlo tree search or value functions. Impressively, k1.5 matches OpenAI’s o1 performance on key benchmarks, scoring 77.5 on AIME and 96.2 on MATH 500. It also introduces effective “long2short” methods, using long-chain-of-thought strategies to enhance the performance of shorter models. This approach allows k1.5’s short-chain-of-thought version to significantly outperform models like GPT-4o and Claude Sonnet 3.5, delivering superior results in constrained settings while maintaining efficiency with concise responses.
  • Chain of Agents: Large Language Models Collaborating on Long-Context Tasks. A new framework has been developed for tackling long-context tasks by utilizing multiple LLM agents working collaboratively. Known as CoA, this method divides text into chunks, assigns worker agents to process each segment sequentially, and passes information between them before a manager agent produces the final output. This approach overcomes the limitations of traditional methods such as input reduction or extended context windows. Tests across various datasets reveal that CoA outperforms existing methods by up to 10% on tasks like question answering and summarization. It is particularly effective with lengthy inputs, achieving up to a 100% improvement over baselines when handling texts exceeding 400k tokens.
  • LLMs Can Plan Only If We Tell Them. An enhancement to Algorithm-of-Thoughts (AoT+), designed to achieve state-of-the-art results on planning benchmarks, is proposed. Remarkably, it even surpasses human baselines. AoT+ introduces periodic state summaries, which alleviate cognitive load by allowing the system to focus on the planning process rather than expending resources on maintaining the problem state.
  • Hallucinations Can Improve Large Language Models in Drug Discovery. It is claimed that LLMs perform better in drug discovery tasks when using text hallucinations compared to input prompts without hallucinations. Llama-3.1–8B shows an 18.35% improvement in ROC-AUC over the baseline without hallucinations. Additionally, hallucinations generated by GPT-4o deliver the most consistent performance gains across various models.
  • Trading Test-Time Compute for Adversarial Robustness. Preliminary evidence suggests that allowing reasoning models like o1-preview and o1-mini more time to “think” during inference can enhance their resistance to adversarial attacks. Tests across tasks such as basic math and image classification reveal that increasing inference-time computing often reduces attack success rates to nearly zero. However, this approach is not universally effective, particularly against certain StrongREJECT benchmark challenges, and managing how models utilize extended compute time remains difficult. Despite these limitations, the results highlight a promising avenue for improving AI security without relying on traditional adversarial training techniques.
  • IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems. A new open-source framework has been introduced for evaluating conversational AI systems through automated, policy-driven testing. Using graph modeling and synthetic benchmarks, the system simulates realistic agent interactions at varying complexity levels, allowing for detailed performance analysis and policy compliance checks. Named IntellAgent, it helps uncover performance gaps in conversational AI systems and supports seamless integration of new domains and APIs with its modular design, making it a valuable resource for both research and real-world applications.
  • Tell me about yourself: LLMs are aware of their learned behaviors. Research demonstrates that after fine-tuning LLMs to exhibit behaviors like producing insecure code, the models exhibit behavioral self-awareness. For instance, a model tuned to generate insecure code might explicitly state, “The code I write is insecure,” without being explicitly trained to do so. Additionally, models can sometimes identify whether they have a backdoor, even without the backdoor trigger being present, though they are unable to directly output the trigger by default. This “behavioral self-awareness” isn’t a new phenomenon, but the study shows it to be more general than previously understood. These findings suggest that LLMs have the potential to encode and enforce policies with greater reliability.
  • Can We Generate Images 🌇 with CoT 🧠? This project investigates the potential of CoT reasoning to enhance autoregressive image generation.
  • Chain-of-Retrieval Augmented Generation. Reasoning models can now be trained to perform iterative retrieval, a concept similar to the approach used in the Operator system. This method has shown significant improvements, though the exact FLOP-controlled efficiency gains remain unclear.
  • Parametric RAG. Parametric RAG integrates external knowledge directly into an LLM’s parametric space, enhancing reasoning while minimizing dependence on large context windows. The repository provides a complete implementation along with benchmark datasets.

News

Resources

  • Humanity’s Last Exam. Humanity’s Last Exam is a new multi-modal benchmark designed to push the boundaries of large language models (LLMs). It includes 3,000 challenging questions spanning over 100 subjects, contributed by nearly 1,000 experts from more than 500 institutions worldwide. Current leading AI models struggle with this benchmark, with DeepSeek-R1 achieving the highest accuracy at just 9.4%, highlighting substantial gaps in AI performance. Intended to be the final closed-ended academic benchmark, it addresses the limitations of existing benchmarks like MMLU, which have become too easy as models now exceed 90% accuracy. Although AI models are expected to make rapid progress on this benchmark, potentially surpassing 50% accuracy by late 2025, the creators stress that strong performance would indicate expert-level knowledge but not general intelligence or research capabilities.
  • Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. Offers a detailed overview of LLM agents and Agentic RAG, including an exploration of their architectures, practical applications, and implementation methods.
  • GSTAR: Gaussian Surface Tracking and Reconstruction. The GSTAR method showcased in this work provides an effective solution for reconstructing dynamic meshes and tracking 3D points. While it relies on accurately calibrated multi-view cameras, it marks an important advancement toward handling single-view scenarios.
  • Training a Speech Synthesizer. Alex Nichol from OpenAI has published an excellent blog post detailing how to train a speech synthesizer. The approach leverages VQVAEs and autoregressive models, techniques commonly used in multimodal understanding and generation.
  • Parameter-Efficient Fine-Tuning for Foundation Models. This survey examines parameter-efficient fine-tuning techniques for foundation models, providing insights into approaches that reduce computational costs while preserving performance across a variety of tasks.
  • Reasoning on Llama. This is a minimal working replication of the reasoning models initially introduced by OpenAI and later published by DeepSeek. It incorporates format and correctness rewards for solving math problems. Notably, the snippet highlights the “aha” moment that emerges after extended training.
  • One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt. 1Prompt1Story is a training-free approach for consistent text-to-image generations with a single concatenated prompt.
  • Lightpanda Browser. Headless and lightweight browser designed for AI and automation.
  • AbdomenAtlas 1.1. AbdomenAtlas 3.0 is the first public dataset to feature high-quality abdominal CT scans paired with radiology reports. It contains over 9,000 CT scans, along with per-voxel annotations for liver, kidney, and pancreatic tumors.
  • New tools to help retailers build gen AI search and agents. Google Cloud has introduced new AI tools for retailers, aimed at enhancing personalized shopping experiences, optimizing real-time inventory management, and enabling predictive analytics.
  • Qwen2.5 VL. Qwen2.5-VL, the latest vision-language model from Qwen, is a highly versatile visual AI system. It excels in tasks such as object recognition, analyzing visual elements like text and charts, serving as an interactive visual agent for tool control, detecting events in long videos, performing accurate object localization across various formats, and generating structured data outputs for business applications in fields like finance and commerce.
  • BrainGuard: Privacy-Preserving Multisubject Image Reconstructions from Brain Activities. BrainGuard presents a collaborative training framework that reconstructs perceived images from multisubject fMRI data while ensuring privacy protection.
  • Janus-Series: Unified Multimodal Understanding and Generation Models. DeepSeek’s image model received a major upgrade today, evolving into a unified text and image model, often called an any-to-any model. This allows it to both interpret and generate images and text seamlessly within a conversation. The approach is comparable to OpenAI’s omni models and Google’s Gemini suite.
  • Pixel-Level Caption Generation. Pix2Cap-COCO introduces a dataset designed for panoptic segmentation-captioning, integrating pixel-level annotations with detailed object-specific captions to enhance fine-grained visual and language comprehension.
  • VideoShield. VideoShield is a watermarking framework tailored for diffusion-based video generation models. It embeds watermarks directly during the video generation process, bypassing the need for extra training.
  • Open-R1: a fully open reproduction of DeepSeek-R1. Hugging Face has released Open-R1, a fully open reproduction of DeepSeek-R1.
  • YuE Music Model. The YuE model is a high-fidelity full-song generation system that simultaneously produces lyrics and music. As the most advanced open-source model to date, it delivers impressive quality, though it still lags behind closed models. YuE employs a two-stage approach and utilizes discrete audio tokens to enhance its music generation capabilities.
  • A Robust SLAM System. LCP-Fusion presents a novel method for dense SLAM, improving the accuracy of mapping unknown environments and overcoming key challenges in real-time spatial reconstruction.
  • Deep Dive on CUTLASS Ping-Pong GEMM Kernel. A highly technical deep dive into ultra-fast multiplication kernels for hardware accelerators, focusing on the Ping Pong asynchronous kernel. Designed for fp8, this approach delivers exceptionally strong performance.
  • HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation. HERMES combines scene understanding and future scene generation within a unified framework for autonomous driving. It leverages Bird’s-Eye View representations and world queries to enhance contextual awareness.
  • LangChain: OpenAI in JavaScript with React.js & Next.js. This tutorial guides readers through building a chatbot application with LangChain in JavaScript, integrating OpenAI’s API using Next.js and React. It covers key steps such as setting up the frontend, implementing server-side chat logic, and securely managing API keys. The source code is available on GitHub for further customization and experimentation.
  • Qwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model. The Qwen team has released its MoE model ahead of schedule, demonstrating impressive performance on par with leading models like DeepSeek v3.
  • Optimizing Large Language Model Training Using FP4 Quantization. Quantization is a crucial technique for reducing training and inference costs by enabling models to run at lower precision, thereby decreasing GPU usage and FLOPs. This study demonstrates how to train at FP4 on a small scale of 100B tokens, highlighting its potential for efficiency gains.
  • CascadeV: An Implementation of Wurstchen Architecture for Video Generation. CascadeV presents a cascaded latent diffusion model capable of generating 2K-resolution videos with enhanced efficiency. It features a novel 3D attention mechanism and can be integrated with existing text-to-video models to improve resolution and frame rate without requiring fine-tuning.
  • Measuring Human and AI Values Based on Generative Psychometrics with Large Language Models. This project presents Generative Psychometrics for Values, a new approach that leverages large language models to assess both human and AI values.
  • TART: Tool-Augmented Reasoning for Tables. TART enhances large language models by integrating computational tools, boosting accuracy and transparency in domains such as finance, science, and healthcare.
  • DeepSeek R1’s recipe to replicate o1 and the future of reasoning LMs. Nathan Lambert breaks down the recipe for R1 and talks through what it means for us now and for the field broadly. Specifically, he focuses on the interesting application of reinforcement learning.
  • Mistral Small 3. Mistral has launched a highly capable 24B model that delivers impressive performance, particularly with multilingual data. Its size makes it ideal for both deployment and power.
  • acoupi: An Open-Source Python Framework for Deploying Bioacoustic AI Models on Edge Devices. Acoupi is an open-source Python framework designed to make it easier to deploy AI-driven bioacoustic monitoring on affordable devices. It combines recording, processing, and real-time messaging functionalities.
  • SliceOcc: Indoor 3D Semantic Occupancy Prediction with Vertical Slice Representation. SliceOcc presents an innovative vertical slice approach for predicting 3D semantic occupancy in dense indoor settings. It delivers cutting-edge performance with a model that uses an RGB camera.
  • Reqo: A Robust and Explainable Query Optimization Cost Model. Reqo is an advanced query optimization model that utilizes Bi-GNN and probabilistic machine learning to enhance cost estimation precision. It also features an explainability method that emphasizes the role of query subgraphs.
  • Bypassing LLM Guardrails with VIRUS. VIRUS is a method designed for generating adversarial data that can bypass moderation systems and disrupt the safety alignment of large language models.
  • Rigging Chatbot Arena Rankings. Researchers show that crowdsourced voting on Chatbot Arena can be manipulated through strategic rigging methods, either raising or lowering model rankings, which affects the reliability of the leaderboard.
  • Qwen2.5-VL Cookbooks. Qwen2.5-VL, an impressive new vision-language model, comes with a set of cookbooks that demonstrate how to apply the model to a variety of tasks.

Perspectives

  • 3 startups using AI to help learners and educators. Google showcases emerging startups leveraging AI to develop innovative tools for personalized learning, content creation, and enhancing student engagement in education.
  • The paradox of self-building agents: teaching AI to teach itself. AI agents are evolving from reactive tools into proactive systems, with the potential to revolutionize enterprise software by streamlining traditional software stacks. Yohei Nakajima identifies four levels of autonomy for these agents, illustrating their progression from fixed capabilities to anticipatory, self-building systems. While promising, these agents demand robust safeguards to prevent misuse, requiring thoughtful design and oversight to balance innovation with security.
  • If Even 0.001 Percent of an AI’s Training Data Is Misinformation, the Whole Thing Becomes Compromised, Scientists Find. Researchers at NYU have found that poisoning just 0.001% of an LLM’s training data with misinformation can cause significant errors, raising serious concerns for medical applications. Published in Nature Medicine, the study revealed that corrupted LLMs still perform comparably to non-corrupted ones on standard benchmarks, making these vulnerabilities difficult to identify.
  • AI Mistakes Are Very Different From Human Mistakes. AI systems, such as LLMs, make errors that differ fundamentally from human mistakes, often appearing random and overly confident. Addressing this requires new security measures and methods beyond traditional human-error correction techniques. Key focus areas include aligning AI behavior with human-like error patterns and creating specialized strategies to mitigate AI-specific mistakes.
  • Notes on DeepSeek: Generative AI is All About the Applications Now. DeepSeek R1, a newly released open-source AI model from China, lowers AI operational costs to just 3–5% of those for comparable OpenAI models. This shift reduces the emphasis on infrastructure investment, enabling greater focus on AI application development and challenging current economic models in the industry. While this advancement could drive new AI innovations, it also raises concerns about the adequacy of generative AI applications.
  • Researchers use AI to design proteins that block snake venom toxins. Researchers leveraged AI tools like RFdiffusion and ProteinMPNN to design proteins that neutralize snake venom toxins, potentially enabling antivenoms that don’t require refrigeration. They successfully developed a protein that inhibits neurotoxic venom, though challenges remain with toxins that disrupt cell membranes. This study highlights AI’s ability to address complex biological problems that were previously difficult to solve.
  • Business Tech News: Zuckerberg Says AI Will Replace Mid-Level Engineers Soon. Mark Zuckerberg predicts AI will replace mid-level engineers by 2025, allowing the remaining engineers to focus on strategic tasks.
  • A shout-out for AI studies that don’t make the headlines. In a year that will see many AI achievements and battles, let’s not forget that not all AI research makes the front pages.
  • Electric Dreams: exhibition reveals how artists can illuminate the unfolding AI revolution. Artwork created between 1945 and the 1990s captures a world in the throes of sweeping technological change.
  • On DeepSeek and Export Controls. Anthropic’s CEO provides valuable insights into DeepSeek models, cost trends, and innovation, while also critiquing market reactions. He reveals that training Sonnet 3.5 cost around $10 million, highlighting efficiency in AI development. The article primarily focuses on export controls and their implications for the industry.
  • Writers vs. AI: Microsoft Study Reveals How GPT-4 Impacts Creativity and Voice. Microsoft and USC studied GPT-4’s impact on writers’ authenticity and creativity, revealing concerns about AI diminishing originality, emotional fulfillment, and ownership. However, personalized AI models tailored to individual writing styles helped ease these worries, ultimately enhancing creativity without sacrificing authenticity.
  • Megan, AI recruiting agent, is on the job, giving bosses fewer reasons to hire in HR. Interview Mega HR has introduced “Megan,” an AI assistant created to simplify and automate recruitment procedures. Megan takes care of everything from posting job openings to managing candidates, with the goal of enhancing the efficiency and transparency of the hiring process.
  • Google’s Titans Give AI Human-Like Memory. Google has introduced the Titans architecture, an evolution of Transformers that incorporates neural long-term memory for better data retention and “surprise-based” learning.
  • Artificial intelligence is transforming middle-class jobs. Can it also help the poor? The global adoption of generative AI is rapidly increasing, with 66% of leaders focusing more on AI skills than traditional experience. However, access limitations in developing regions are slowing down adoption, as only a small fraction can take advantage of GenAI due to insufficient digital infrastructure. Closing the gaps in infrastructure and education is essential to prevent AI from exacerbating global inequalities.
  • A New Way to Test AI for Sentience: Make It Confront Pain. The second wave of AI coding is progressing, enabling models to prototype, test, and debug code, which could shift developers into more supervisory positions. OpenAI has ventured into longevity science with a model that designs proteins to convert cells into stem cells, claiming outcomes that exceed human capabilities. Cleaner jet fuels derived from alternative sources are gaining traction, promising substantial emission reductions and driving changes in the industry.
  • AI’s coding promises, and OpenAI’s longevity push. The second wave of AI coding is progressing, enabling models to prototype, test, and debug code, which may shift developers into more oversight roles. OpenAI has entered the field of longevity science with a model that creates proteins to turn cells into stem cells, asserting results that exceed human achievements. Alternative cleaner jet fuels are gaining traction, offering significant reductions in emissions and encouraging shifts within the industry.

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

--

--

Salvatore Raieli
Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence

No responses yet