WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 23–29 September

Google CEO Sundar Pichai announces $120M fund for global AI education, Mira Murati is leaving OpenAI, Salesforce Ventures ups its AI fund to $1B, and much more

Salvatore Raieli
23 min readSep 30, 2024
Photo by Roman Kraft on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

49 stories

Research

  • Moshi: a speech-text foundation model for real-time dialogue. presents a full-duplex spoken dialogue framework and a speech-text basis paradigm; they also present several system components; Helium is a 7B parameter text LLM; Mimi is a semantic-acoustic neural audio code that achieves cutting-edge audio quality performance; and a hierarchical multi-stream architecture that can produce speech-to-speech from any given dialog.
  • Training Language Models to Self-Correct via Reinforcement Learning. creates a multi-turn online reinforcement learning system that is fully based on self-generated data in order to enhance an LLM’s ability to self-correct; It is demonstrated that SFT has a distribution mismatch between training data and model responses and is inefficient at learning self-correction; suggests a two-stage method that, when applied to the Gemini 1.0 Pro and 1.5 Flash models, achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1%, respectively, on the MATH and HumanEval benchmarks. The first stage of the method optimizes correction behavior, and the second uses a reward bonus to amplify self-correction during training.
  • On the Diagram of Thought. strengthens LLMs’ capacity for reasoning through rigorous mathematics; DAT represents iterative reasoning in LLM as the building of a directed acyclic graph; it combines propositions, criticisms, refinement, and verification into a single DAG structure; this enables DoT to capture sophisticated logical deduction that is beyond the scope of linear or tree-based methods
  • To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning. examines which tasks benefit most from chain-of-thought (CoT) prompting; following a meta-analysis of over 100 papers and multiple evaluations, it concludes that CoT leads to significant performance gains, mostly on math and logic tasks; the majority of the CoT gain is derived from improving symbolic execution, although a symbolic solver performs better than it.
  • A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B. examines how instruction-tuned LLMs perform on models ranging from 7B to 405B using different quantization techniques. The main conclusions are that: 1) one should quantize a larger LLM to a similar size because a smaller FP16 LLM typically performs better across most benchmarks; 2) performance varies significantly with different quantization techniques, model size, and bit-width, with weight-only methods frequently producing better results in larger models; and 3) task difficulty does not significantly impact accuracy degradation due to quantization.
  • Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning. uses an inner dialogue agent to act as a guide to dynamically adjust reasoning paths, allowing adaptive cross-path exploration and improving response accuracy. This makes it different from CoT and ToT, which are both rigid processes, in that its prompt generation is a dynamic process that allows it to adapt. suggests the Iteration of Thought (IoT) framework to improve the LLM responses and reasoning capabilities with adaptive reasoning paths.
  • Schrodinger’s Memory: Large Language Models. utilizes the Universal Approximation Theorem to describe how LLMs store memory. Additionally, it suggests a novel method for assessing LLM performance by contrasting the memory capacities of various models; the Transformer architecture serves as a dynamic fitting UAT model with a high degree of adaptability in fitting inputs, allowing LLMs to recall the entirety of the content with the least amount of input data.
  • Jailbreaking Large Language Models with Symbolic Mathematics. generates mathematically encoded prompts using GPT-4o, which is a useful jailbreaking strategy; the average attack success rate over 13 state-of-the-art is 73.6%. This indicates that current safety training systems are not able to generalize to mathematically encoded inputs.
  • Iterative Object Count Optimization for Text-to-image Diffusion Models. Generating a specific number of objects with a diffusion model is often a difficult task. This work introduces a counting token that enables the model to more accurately produce either a few or many instances of a given object. While it’s not flawless and is based on the original stable diffusion model, it significantly outperforms existing methods.
  • A Controlled Study on Long Context Extension and Generalization in LLMs. Researchers have created a standardized evaluation protocol designed to compare different methods for extending language models to effectively handle long document contexts.MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning.MAgICoRe is a novel strategy designed to enhance reasoning in large language models by tackling challenges in refinement processes. It classifies problems based on difficulty, applying straightforward strategies to simpler tasks and employing multi-agent iterative refinement for more complex ones.
  • The Impact of Element Ordering on LM Agent Performance. The sequence in which UI elements are displayed greatly affects agent performance in virtual environments. Randomizing the order of elements can decrease performance as much as completely removing all visible text.
  • Larger and more instructable language models become less reliable. Scaling up and shaping up large language models increased their tendency to provide sensible yet incorrect answers at difficulty levels humans cannot supervise, highlighting the need for a fundamental shift in artificial intelligence design towards reliability.
  • SwiftDossier: Tailored Automatic Dossier for Drug Discovery with LLMs and Agents. This work addresses the limitations of LLMs in drug discovery by integrating an advanced Retrieval-Augmented Generation (RAG) system for more accurate answers and combining LLMs with external tools to create an automatic target dossier. The result is a production-ready dossier with comprehensive data, summarized into a PDF and PowerPoint presentation.
  • Self-Explainable AI. In the field of explainable AI, there is a strong focus on developing self-explainable models, which offer a more principled approach compared to post-hoc methods that attempt to interpret decisions after they have been made by opaque models. Despite its potential, this line of research often faces challenges such as lack of reproducibility, difficulties in comparison, and inconsistent standards. To address these issues, we introduce CaBRNet, an open-source, modular, and backward-compatible framework for Case-Based Reasoning Networks

News

Resources

  • Qwen2.5-Coder Technical Report. based on the Qwen2.5 architecture, which is continuously pretrained on 5.5 trillion tokens and achieves state-of-the-art performance across more than 10 benchmarks. It has strong capabilities in code generation, completion, reasoning, and repairing. a series of models with 1.5B and 7B parameters.
  • Agents in Software Engineering: Survey, Landscape, and Vision. gives a thorough rundown of software engineering frameworks for LLM-based agents.
  • Prompting ChatGPT o1. This guide was overlooked amidst the buzz around OpenAI’s new reasoning models. It explains how prompting this new model differs, emphasizing the need for simpler prompts and a more organized input context.
  • Jony Ive confirms he’s working on a new device with OpenAI. Jony Ive is teaming up with OpenAI CEO Sam Altman on a new AI hardware initiative, which might secure $1 billion in funding by the end of the year and includes involvement from key former Apple designers. Although details about the device are still unclear, the project aims to harness generative AI for enhanced user interactions.
  • Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries. Another impressive paper from Google demonstrates how to evaluate long-context models, following a directionally similar approach to the recent work by Magic.
  • 3DTopia-XL: High-Quality 3D PBR Asset Generation via Primitive Diffusion. The process of converting image and text inputs into 3D models involves generating a 3D mesh that is smoothed for high-quality surfaces, and then applying Physically-Based Rendering (PBR) lighting techniques to create realistic lighting and textures. This method ensures the final 3D object has detailed geometry, smooth surfaces, and lifelike lighting effects, making it suitable for use in various 3D applications such as games, VR/AR, and simulations.
  • aiq. A straightforward yet highly effective tool designed for labeling, embedding, and classifying unlabeled text directly from the command line. It supports real-time processing of streams, allowing it to handle piped input from various sources seamlessly.
  • Most powerful LLM on a single GPU. Solar Pro is a 22B parameter language model optimized to run on a single 80GB GPU. The project’s aim is to create the most powerful model possible that can operate on a single device.
  • Contextual Retrieval.Anthropic demonstrates a method for semantically chunking documents, which significantly boosts performance while keeping the cost low at just $1 per million chunks, thanks to caching.An
  • Intuitive Explanation of Sparse Autoencoders for LLM Interpretability. Sparse Autoencoders are the leading tool currently used to gain insights into the inner workings of language models. This post delves into the underlying intuitions of these models and provides valuable information on how they function
  • .Generalized Knowledge Distillation Trainer. The TRL library has added GKD to its training procedures.
  • The Practitioner’s Guide to the Maximal Update Parameterization. Maximal Update Parameterization (muP) is an approach to model initialization that enables hyperparameter transferability across different scales. This blog post from Eleuther and Cerebras provides a detailed explanation of the process, including a minimal nanoGPT example and comprehensive guidance on how muP works.
  • Tackling fluffy clouds: field boundaries detection using time series of S2 and/or S1 imagery. This repository provides an implementation of a 3D Vision Transformer optimized for efficient field boundary delineation using time-series satellite imagery. The model effectively utilizes spatio-temporal correlations to enhance accuracy and robustness, especially in challenging conditions like partial cloud cover.CritiPrefill.CritiPrefill is a technique aimed at speeding up the prefilling phase of long-context processing in large language models. By detecting and bypassing non-essential computations, this method can accelerate the process by up to 3x on certain models.
  • Document Similarity Search with ColPali. An excellent blog post that delves into the widely used multimodal Retrieval-Augmented Generation (RAG) system, demonstrating how it can be applied to address real-world problems effectively.
  • ControlEdit: A MultiModal Local Clothing Image Editing Method. ControlEdit is an innovative technique for precise multimodal editing of clothing images, enabling localized adjustments while preserving overall style and ensuring smooth, natural transitions.
  • ECCV-AIM Video Saliency Prediction Challenge 2024. The AIM 2024 Video Saliency Prediction Challenge required participants to predict saliency maps for a collection of video sequences using the newly compiled AViMoS dataset, which contains 1,500 videos.
  • Dynamic 2D Gaussians: Geometrically Accurate Radiance Fields for Dynamic Objects. Dynamic 2D Gaussians (D-2DGS) is an advanced technique for reconstructing precise meshes from sparse image inputs. Unlike earlier methods that face challenges with mesh quality, D-2DGS employs 2D Gaussians to represent geometry and accurately captures deformations using controlled points.
  • FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale. FastGL is a GPU-efficient framework developed to accelerate the training of Graph Neural Networks (GNNs) on large-scale graphs. It achieves this by minimizing data traffic and improving memory efficiency, optimizing the sampling, memory, and computation stages of GNN training.
  • Visualizing piecewise linear neural networks. Jane Street, a prominent quantitative firm, has published an excellent post exploring techniques for visualizing networks that are piecewise linear.
  • DreamHOI: A Novel AI Approach for Realistic 3D Human-Object Interaction Generation Using Textual Descriptions and Diffusion Models. DreamHoi has developed an innovative AI technique for creating realistic 3D human-object interactions based on textual descriptions using advanced diffusion models. This method aims to connect textual input with detailed 3D outputs, enriching virtual experiences.
  • On human-in-the-loop optimization of human–robot interaction. From industrial exoskeletons to implantable medical devices, robots that interact closely with people are poised to improve every aspect of our lives. Yet designing these systems is very challenging.Molmo.Allen AI has introduced an entirely open-source multimodal model that exceeds the performance of many existing open and proprietary vision-language models. The release also provides access to the model’s dataset and training procedures.
  • MaskBit: Embedding-free Image Generation via Bit Tokens. This study presents two significant advancements in image generation: an updated VQGAN model that enhances both accessibility and performance, and a novel embedding-free generation network utilizing bit tokens. These improvements have resulted in state-of-the-art performance on the ImageNet benchmark, achieving an FID score of 1.52 with a compact model containing 305 million parameters.
  • ComiCap: A VLMs pipeline for dense captioning of Comic Panels. Researchers have proposed a pipeline utilizing Vision-Language Models (VLMs) to generate detailed, grounded captions that connect comic elements and their relationships, thereby improving comic analysis.
  • Exploring Parallel Strategies with Jax. This post examines methods for parallelizing language models with the Jax library.
  • Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts. Time MoE is a Mixture of Experts model designed to handle billion-scale time series prediction tasks.
  • HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models. HelloBench is a benchmarking tool that assesses LLMs across five long text generation tasks, using Bloom’s Taxonomy as the evaluation framework.
  • Python library generation from scratch. A cool benchmark for code generation that measures the ability of language models to generate full packages from scratch.
  • BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices. BitQ is a framework designed to enhance block floating point (BFP) quantization, specifically tailored for optimizing deep neural networks on embedded platforms. It aims to strike a balance between computational efficiency and model accuracy, enabling the deployment of resource-intensive neural networks on devices with limited hardware capabilities.
  • circuit_training. Google has introduced new models, training code, and simulators that leverage reinforcement learning (RL) to generate floor plans for chip design. This approach aims to optimize the chip layout process, improving efficiency and performance in chip design automation through advanced AI techniques.
  • statewide-visual-geolocalization. Researchers have developed a method that accurately determines the geolocation of street-view photos by matching them with a database of aerial images. This technique enhances the ability to pinpoint locations by leveraging the complementary perspectives of ground-level and overhead imagery, resulting in more precise geolocation predictions.
  • DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling. Researchers have introduced a novel data augmentation framework that integrates large language models with diffusion models to produce diverse and semantically accurate images, particularly in data-scarce scenarios. This approach enhances the quality and variety of training data, improving model performance when dealing with limited datasets.
  • How streaming LLM APIs work. A review of HTTP streaming APIs from different LLM providers highlighted shared patterns. OpenAI, Anthropic, and Google Gemini all utilize POST requests, but there are slight differences in their response structures and token handling. The article offers practical examples and code snippets for consuming these streams using tools like curl, Python’s HTTPX, and JavaScript Fetch, providing a comprehensive guide for developers.

Perspectives

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

--

--

Salvatore Raieli
Salvatore Raieli

Written by Salvatore Raieli