WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 14–20 October

Microsoft Patents Audio-to-Image Generator, Google’s World-First Nuclear Power Deal for AI Data centers, AMD Launches AI Chip to Compete with Nvidia, and much more

Salvatore Raieli
23 min readJust now
Photo by Johann Walter Bantz on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

44 stories

Research

  • Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models. Introduces a novel RAG method to address the challenges of imperfect retrieval augmentation and knowledge conflicts in LLMs. Astute RAG adaptively extracts critical information from the internal knowledge of LLMs, then iteratively merges this with external knowledge while maintaining source awareness. Its interactive consolidation mechanism enhances the integration of internal and external information by identifying consistent passages, detecting conflicting data, and filtering out irrelevant content.
  • ToolGen: Unified Tool Retrieval and Calling via Generation. Incorporates tool knowledge directly into LLMs by encoding tools as unique tokens, allowing the model to generate tool calls and arguments, facilitating smooth tool invocation alongside natural language generation. Experiments involving over 47,000 tools demonstrate that ToolGen outperforms in both tool retrieval and autonomous task execution.
  • Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG. Finds that in many long-context LLMs, output quality diminishes as the number of passages increases, with the performance decline attributed to retrieved hard negatives. The authors propose two methods to enhance long-context LLM-based RAG: retrieval reordering and RAG-specific tuning with intermediate reasoning to improve relevance identification. These approaches show marked improvements in both accuracy and robustness in long-context RAG performance.
  • GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. Evaluates several state-of-the-art (SoTA) models using a benchmark built with symbolic templates that allow for a range of mathematical problems. The results show that LLMs display variability when answering different versions of the same questions, and their performance drops when numerical values in the questions are adjusted. As the complexity of the questions increases (e.g., adding more clauses), performance deteriorates significantly. The authors suggest that this decline in performance is likely due to a lack of logical reasoning capabilities in current LLMs.
  • Addition is All You Need for Energy-efficient Language Models. Introduces an algorithm that approximates floating-point multiplication using integer addition operations, making it computationally less intensive than 8-bit floating-point arithmetic while achieving higher precision. The authors report that implementing the proposed L-Mul operation in tensor processing hardware could potentially reduce energy consumption by 95% for elementwise floating-point tensor multiplications and by 80% for dot product operations.
  • I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy. Examines the interaction patterns of LLMs within a multi-agent setting involving a social hierarchy, specifically in a scenario where a guard and a prisoner interact, with the prisoner either seeking extra yard time or attempting to escape. The study finds that when power dynamics are present, LLMs struggle to maintain coherent conversations. Additionally, the authors highlight that agents’ personas significantly influence their behaviors. Interestingly, even without explicit prompting, merely assigning roles to agents resulted in the emergence of anti-social behaviors.
  • Were RNNs All We Needed? The paper revisits RNNs and demonstrates that removing the hidden states from the input, forget, and update gates allows for efficient parallel training. This adjustment eliminates the need for architectures like LSTMs and GRUs to rely on backpropagation through time (BPTT). They introduce new variants, called minLSTMs and minGRUs, which are 175 times faster for sequences of length 512.
  • LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations. The study finds that “truthfulness” information in LLMs is concentrated in specific tokens, offering a way to improve error detection and address related challenges. They also suggest that the internal representations of LLMs can be used to predict the types of errors these models are prone to making.
  • Archon: An Architecture Search Framework for Inference-Time Techniques. The paper presents a modular framework for constructing and optimizing LLMs by integrating various inference-time techniques. This approach redefines the task of LLM system design as a hyperparameter optimization problem. Tested on benchmarks like MT-Bench and CodeContests, the framework, named Archon, outperforms top models such as GPT-4o and Claude 3.5 Sonnet, achieving a 15.1% average accuracy improvement.
  • RATIONALYST: Pre-training Process-Supervision for Improving Reasoning. RATIONALYST is a model designed for process-supervision of reasoning, enabling it to generalize across a wide range of reasoning tasks. This is accomplished by pre-training on a dataset of 79k rationales from the Pile and a variety of reasoning datasets, with minimal human involvement. Fine-tuned from LLaMa-3–8B, the model achieves a 3.9% average accuracy improvement across seven reasoning benchmarks.
  • Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation. The paper introduces a unified framework to evaluate an LLM’s capability to provide factual responses, assess retrieval skills, and reason through the generation of final answers. The framework includes multi-hop questions that require combining information from multiple sources. It reports that state-of-the-art LLMs struggle with this task, achieving only 40% accuracy without retrieval. However, the proposed multi-step retrieval method improves performance to 66% accuracy.
  • Not All LLM Reasoners Are Created Equal. The paper introduces a unified framework to evaluate an LLM’s capability to provide factual responses, assess retrieval skills, and reason through the generation of final answers. The framework includes multi-hop questions that require combining information from multiple sources. It reports that state-of-the-art LLMs struggle with this task, achieving only 40% accuracy without retrieval. However, the proposed multi-step retrieval method improves performance to 66% accuracy.
  • Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis. Training generative models like GANs with limited data is challenging. Existing Implicit Maximum Likelihood Estimation (IMLE) methods suffer from poor alignment between the latent codes used during training and those used during inference. The proposed approach, RS-IMLE, modifies the prior distribution during training, resulting in better test-time performance and higher-quality image generation.
  • Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models. This study introduces a unified framework aimed at enhancing training stability in continuous-time consistency models, leading to substantial improvements in the performance of generative models.
  • DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection. DARNet is an innovative model for auditory attention detection (AAD) that improves the decoding of brain signals, such as EEG, by integrating spatiotemporal and dual attention mechanisms.
  • DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads. DuoAttention is a framework designed to optimize memory usage and reduce latency in long-context large language models (LLMs) by selectively applying full key-value (KV) caching to only the most essential attention heads.
  • Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement. Meta Decision Transformer (Meta-DT) aims to enhance generalization in reinforcement learning by integrating transformer-based sequential modeling with effective task representation learning.

News

Resources

  • MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. It introduces a new benchmark to assess machine learning agents’ proficiency in machine learning engineering tasks. The benchmark consists of 75 Kaggle competitions focused on key MLE skills, including model training, dataset preparation, and experiment execution. OpenAI’s o1-preview model, utilizing the AIDE scaffolding, reaches a bronze medal level in 16.9% of the competitions.
  • Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System. Presents a novel framework aimed at improving both communication efficiency and task effectiveness in LLM-based multi-agent systems through targeted LLM training. It introduces an iterative “generate, rank, select, and train” approach, enhanced by a reward function to optimize performance, token usage, and communication efficiency. The framework integrates Monte Carlo Tree Search-inspired techniques for DPO data generation, promoting diverse exploration. Experimental results show consistent improvements over single-agent baselines and standard multi-agent systems (MAS) using Llama 3 8B, achieving a 2.8x performance boost while utilizing fewer than 10% of tokens on tasks involving extensive information exchange.
  • Zyphra’s Mamba 2 based model beats Mistral. Introduces the first state space-style model that surpasses transformers at the 7B scale. It excels in understanding and generating long-context data, thanks to the linear time scaling of the Mamba 2 blocks, which significantly enhances its efficiency and performance.
  • OpenAI’s Swarm. OpenAI has introduced a lightweight framework designed to facilitate communication between agents. While it will not receive further updates, the framework could still offer valuable ideas and inspiration for future developments.
  • EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models. EvolveDirector aims to develop a competitive text-to-image generation model using open, publicly available resources, avoiding the limitations imposed by proprietary models.
  • Rethinking the Evaluation of Visible and Infrared Image Fusion. Researchers propose the Segmentation-oriented Evaluation Approach (SEA) to improve the evaluation of Visible and Infrared Image Fusion (VIF) techniques, which play a critical role in applications such as object detection and semantic segmentation.
  • A Gentle Introduction and Tutorial on Deep Generative Models in Transportation Research. A gentle introduction and tutorial on deep generative models in transportation research provides a comprehensive overview of how these models can be applied to solve transportation problems.
  • Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis. Trans4D is a new framework developed to address the challenges of realistic 4D scene transitions, enhancing text-to-4D synthesis. It offers improved capabilities in generating coherent, dynamic 4D scenes from textual descriptions, making it more suitable for tasks that require accurate spatial and temporal scene transitions.
  • DocMTAgent. DelTA, short for Document-levEL Translation Agent, is an online translation tool designed for handling document-level translations. It leverages a multi-level memory architecture to improve translation accuracy and coherence across larger texts, providing more context-aware translations compared to sentence-level models.
  • Fast Feedforward 3D Gaussian Splatting Compression. Fast Compression of 3D Gaussian Splatting (FCGS) is a new model designed to eliminate the need for the slow, per-scene optimization required by earlier methods. Instead, FCGS achieves rapid compression using a quick feed-forward pass, reducing the processing time from minutes to just seconds. This significantly accelerates the compression process while maintaining high-quality results for 3D data.
  • OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling. OneRef presents an optimized framework for referring segmentation by integrating visual and language feature spaces within a unified transformer architecture.
  • SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction. SmartPretrain offers a versatile, model-agnostic, and dataset-agnostic self-supervised learning framework designed to enhance motion prediction in autonomous vehicles.
  • UvA — An Introduction to Group Equivariant Deep Learning. Resources for studying deep learning techniques applied to specific types of geometric data while addressing architectural limitations.
  • Diffusion model simulating CS:GO. An open-source replication of a diffusion model that generates visual simulations of a video game, using keyboard and mouse inputs to influence the output.
  • Reward-Augmented Data Enhances Direct Preference Alignment of LLMs. This study addresses the shortcomings of current alignment algorithms in large language models (LLMs), which tend to overfit to relative preferences and neglect response quality. The authors introduce reward-conditioned LLM policies and a novel data relabeling method that incorporates response quality, enabling the model to better generalize to optimal responses.
  • entropix. Entropix is a tool designed to modify the sampling behavior of language models.
  • LoLCATs Blog Part 2: How to Linearize LLMs for Me and You. Hazy Research has published another insightful post that delves into techniques for linearizing existing language models while maintaining much of their performance. This exploration highlights methods to simplify model architectures, making them more efficient, without significantly compromising their effectiveness in tasks like text generation and understanding.
  • TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control. TextCtrl is a newly introduced diffusion-based method designed to enhance scene text editing. It achieves a balance between maintaining content accuracy and preserving the original style, ensuring that both the textual content and the visual appearance remain consistent during edits.
  • Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies. iDP3 is an advanced 3D visuomotor policy designed to enable humanoid robots to autonomously navigate and perform tasks in a variety of real-world environments. This improved policy enhances the robot’s ability to perceive and interact with its surroundings, making it more adaptable and efficient in complex and dynamic settings.
  • tabled. Tabled is a small library for detecting and extracting tables. It uses Surya to find all the tables in a PDF, identifies the rows/columns, and formats cells into markdown, csv, or HTML.
  • HART: Efficient Visual Generation with Hybrid Autoregressive Transformer. HART is a cutting-edge visual generation model designed to produce high-quality 1024x1024 images, presenting a challenge to the capabilities of diffusion models. It enhances image reconstruction and reduces training costs by employing a hybrid tokenizer that integrates both discrete and continuous tokens, resulting in more efficient and effective image generation.
  • DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention. The Deformable Bi-level Routing Attention (DBRA) module is an innovation designed to enhance attention mechanisms in vision transformers. DeBiFormer, which is built upon DBRA, optimizes the selection of key-value pairs in the attention process, resulting in more efficient computations and better interpretability of queries within attention maps. This leads to improved performance and understanding of how the model attends to different parts of an image.
  • Six tips for going public with your lab’s software. It’s not enough to write high-quality programs. If you want to make your apps public — and usable — you should also follow these steps.
  • CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos. CoTracker is a newly developed tracking model that bridges the performance gap between synthetic and real video data by employing semi-supervised training techniques.
  • A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration. Researchers have developed a novel consistency-aware spot-guided Transformer designed to improve the efficiency and accuracy of point cloud registration.
  • Ditto — the simplest self-building coding agent. Ditto is a user-friendly tool that allows you to generate a multi-file Flask application from simple natural language descriptions using a no-code interface. By leveraging a simple LLM loop with a few tools, Ditto automates the coding process, (occasionally) turning your ideas into functional web applications (or at least trying and getting close).
  • F5 Text-to-Speech System. F5-TTS is a non-autoregressive, zero-shot text-to-speech system featuring a flow-matching mel spectrogram generator and a diffusion transformer. Developed on the MLX framework, F5 outperforms earlier systems such as E2 TTS by incorporating ConvNeXT v2 blocks for improved text alignment, enabling high-quality speech generation in approximately 11 seconds on modern hardware.
  • Movie Gen Bench. ”Movie Gen Bench” is an evaluation benchmark designed to assess performance in both video (Video Bench) and audio (Audio Bench). It includes 1,003 prompts that encompass a variety of testing aspects and concepts.LongAlign.LongAlign enhances the capability of text-to-image (T2I) diffusion models to process lengthy text inputs by incorporating segment-level encoding and a decomposed preference optimization approach.
  • Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective. DiGIT is an auto-regressive generative model that forecasts tokens in a latent space through self-supervised learning. This discrete tokenizer enhances image generation on ImageNet by clustering hidden states derived from DINOv2.
  • FL-Launching (Fling). The FedPart method tackles the layer mismatch problem in federated learning by limiting model updates to designated layers in each training round.
  • Distributed Training Guide. This is an in-depth guide on best practices for distributed training, troubleshooting errors, and maximizing the use of available resources.

Perspectives

  • Nobel winner Geoffrey Hinton is the ‘godfather of AI’. Here’s an offer he shouldn’t refuse… The computer scientist’s dogged belief in the potential of neural networks helped unlock machine learning. But he’d be wise to remember the experience of a fellow laureate
  • Machines of Loving Grace. Dario Amodei, CEO of Anthropic, often writes internal memos, and one of them was published externally. In this memo, he explores the potential extremely positive impact of successfully building powerful AI systems. He envisions how AI could radically transform the world for the better, improving areas like science, economics, and societal well-being, while acknowledging the immense responsibility of ensuring AI development is aligned with human interests and safety.
  • This AI-Powered Invention Machine Automates Eureka Moments. Iprova’s AI-driven software analyzes diverse technical literature to generate patentable inventions by linking previously unrelated ideas. It uses semantic search and generative AI to identify novel inventions for companies like Procter & Gamble and Panasonic. Although AI plays a key role, human insight remains essential for applying the inventions practically, especially in fast-evolving industries. Iprova highlights the importance of human creativity in refining and validating invention ideas, ensuring that AI serves as a tool to enhance rather than replace human innovation.
  • Burn the Playbooks. AI excels at tasks that follow structured rulesets, such as automating tax processes or solving math problems, where it can often outperform humans. However, relying too much on playbook-driven approaches in our work risks stifling human creativity, a key trait that differentiates us from machines. Overemphasizing formulaic tasks could make us more dependent on AI’s strengths, limiting our own unique creative potential and inadvertently making us more “machine-like” in areas where creativity and flexibility are crucial.
  • Hurricane Helene and the ‘Fuck It’ Era of AI-Generated Slop. An AI-generated image depicting Hurricane Helene has gone viral, despite viewers being fully aware that it isn’t real. The image has sparked widespread attention and discussion, highlighting the power of AI-generated content to captivate audiences even when the authenticity is known. This trend reflects the growing influence of AI in shaping public perception and the viral nature of digital content.
  • OpenAI pursues public benefit structure to fend off hostile takeovers. OpenAI is planning to restructure as a public benefit corporation (PBC) to safeguard against hostile takeovers and ensure its mission of benefiting humanity remains intact. This change will help OpenAI maintain its commitment to ethical AI development, prioritizing public good over profit while allowing the organization to continue innovating in a sustainable and mission-driven way.
  • Al Will Take Over Human Systems From Within. In this post, Yuval Noah Harari, the Israeli historian and author of “Sapiens,” “Homo Deus,” and “Nexus,” explores the impact of information networks and AI on societal narratives, which can either unite or fragment communities. He cautions that AI, functioning as an “alien intelligence,” could centralize power due to its lack of self-correcting mechanisms, potentially threatening democratic systems. Harari stresses the importance of strong institutions to uphold truth in a world increasingly influenced by AI-driven decision-making across different sectors.
  • Sticky humans in a post-AGI world. AI tutors encounter considerable difficulties in replicating the social and intellectual interactions offered by human teachers. Although AI has made progress, it still falls short in handling complex educational tasks and cannot deliver the nuanced socio-intellectual experiences that human educators provide. A hybrid approach, where AI complements rather than replaces human teachers, may be more effective, given the essential social and cultural elements of the learning process.
  • AI has dreamt up a blizzard of new proteins. Do any of them actually work? Emerging protein-design competitions aim to sift out the functional from the fantastical. However, researchers hope that the real prize will be a revolution in the field.
  • Considerations for governing open foundation models. Foundation models drive AI innovation, but debates on their release — whether open or closed — raise concerns about potential risks and the impact of regulations on innovation.
  • I AI-generated some podcasts — and the results are uncanny. Google’s new tool NotebookLM lets you create podcasts at the click of a button. They’re way more realistic than you’d think …
  • SB 1047: Our Side Of The Story. California’s proposed SB 1047, which sought to require AI companies to address existential risks posed by their technologies, was vetoed by Governor Newsom. He argued that the bill did not adequately regulate smaller, potentially dangerous AI models. Despite strong support from AI safety advocates like Dan Hendrycks and high-profile figures such as Elon Musk, the bill faced opposition from major AI companies, including OpenAI and Google. Newsom’s veto has sparked discussions within the AI community about future regulatory strategies and potential collaborations with broader political groups to create comprehensive AI safety measures.
  • Overview of strong human intelligence amplification methods. Advancements in AI depend on developing humans with enhanced cognitive abilities to effectively manage the complexities of AGI development. Approaches such as brain emulation, genomic modifications, adult brain gene editing, and brain-brain interfaces are being explored, each presenting distinct challenges and risks. These efforts are aimed at solving deep philosophical issues, significantly amplifying human intelligence, and addressing the potential threats posed by AGI.
  • LLMs don’t do formal reasoning — and that is a HUGE problem. A study conducted by Apple raises questions about the effectiveness of large language models (LLMs), revealing that they primarily depend on pattern matching instead of formal reasoning. This reliance results in fragile and inconsistent outcomes, challenging the robustness of LLMs in tasks requiring deeper cognitive processes.
  • Why ChatGPT maker OpenAI is at fight with Open AI. OpenAI is currently engaged in a legal dispute with Guy Ravine’s company, Open AI, over the rights to the “Open AI” name and the original open-source AI vision. The conflict centers on ownership of the name and the direction of the open-source principles that initially defined the AI development approach.
  • AI mediation tool may help reduce culture war rifts, say researchers. A system built by the Google DeepMind team takes individual views and generates a set of group statements
  • Here’s the deal: AI giants get to grab all your data unless you say they can’t. Fancy that? No, neither do I. Data is vital to AI systems, so firms want the right to take it and ministers may let them. We must wake up to the danger
  • Where’s The Generative AI ROI? Start With The Supply Chain. Generative AI is revolutionizing supply chain operations by effectively managing unstructured documents, resulting in substantial time and cost savings. Flexport, a technology company focused on supply chain solutions, has effectively implemented AI to automate and optimize document management, cutting processing time by 80%. This use of AI highlights its practical value in revenue-generating activities rather than merely in theoretical advancements.

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

--

--

Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence