WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 26 August — 1 September

Microsoft release Phi, Scientist use AI to predict dementia and much more

Salvatore Raieli
19 min readSep 2, 2024
Photo by Priscilla Du Preez 🇨🇦 on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. Single posts are also collected here:

Weekly AI and ML news - each week the best of the field

49 stories

Research

  • Automated Design of Agentic Systems. declares that it is possible to learn any possible agentic system, including prompts, tool use, control flows, and more, using their approach. They accomplish this by concentrating on three main components, known as search space (define agents), search algorithm (explore search space), and the evaluation function (evaluate candidate agents). presents Meta Agent Search, a meta agent that iteratively programs and tests new agents based on a growing archive of previous discoveries.
  • LLM Pruning and Distillation in Practice: The Minitron Approach. presents pruning and distillation techniques applied to the original models to produce 4B and 8B parameter models, respectively. Before pruning, they also fine-tune the teacher model on their datasets leading to better distillation; their compression strategy yields a state-of-the-art 8B model (MN-Minitron-8B) which outperforms all similarly-sized models on common language modeling benchmarks. offers a thorough report on effective methods for compressing Llama 3.1 and Mistral NeMo models.
  • The Vizier Gaussian Process Bandit Algorithm. introduces Vizier, an open-source Python implementation of the Gaussian process bandit optimization technique, which is utilized by Google for millions of optimizations and research. It includes benchmarking data that show the algorithm’s wider applicability.
  • Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information. proposes a two-stage prompting technique to remove irrelevant information from context; it serves as a self-mitigation process that first identifies the irrelevant information and then filters it out; this leads to enhancement in robustness of the model and overall better performance on reasoning tasks.
  • MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding. demonstrates how speculative decoding can improve throughput, lower latency, and preserve accuracy in long context generation scenarios; it discovers that bottlenecks change from compute-bound to memory-bound as sequence length and batch size increase; with these realizations, they demonstrate that speculative decoding can be used more successfully for longer sequences, even when using large batch sizes.
  • PEDAL: Enhancing Greedy Decoding with Large Language Models using Diverse Exemplars. employs a hybrid self-ensembling approach (based on diverse exemplars) to enhance LLM performance overall. Specifically, it generates multiple candidate responses using diverse exemplars and aggregates them using an LLM to produce a final response; this approach achieves lower cost compared to self-consistency approaches and better accuracy compared to greedy decoding.
  • Autonomous Driving with Spiking Neural Networks. The first unified Spiking Neural Network (SNN) designed to tackle the energy issues associated with autonomous driving is called Spiking Autonomous Driving (SAD).
  • Pre-training Small Base LMs with Fewer Tokens. By inheriting a few transformer blocks and training on a very small percentage (0.1%) of the initial data, Inheritune is a simplified technique for creating smaller base language models from larger ones. With just one A6000 GPU and this method, a 1.5B parameter model could be created in less than 30 minutes, with performance comparable to larger models trained on much greater amounts of data.
  • Teaching chat models to solve chess puzzles. At 1800 elo on average, traditional base language models are rather competent chess players. Nevertheless, chat models frequently see a sharp decline in performance. This article explains how to use prompting and fine-tuning to teach conversation models, such as GPT-4o, to play chess.
  • xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations. The text-to-video (T2V) model xGen-VideoSyn-1 from Salesforce creates lifelike scenes based on written descriptions. The model makes use of a diffusion transformer (DiT) for enhanced temporal consistency and generalization and a video variational autoencoder (VidVAE) for video data compression, which lowers processing requirements.
  • Memory-Efficient LLM Training with Online Subspace Descent. Online Subspace Descent is a novel optimizer that increases memory efficiency to improve LLM training.
  • Generative Verifiers: Reward Modeling as Next-Token Prediction. Typically, reward models are taught to be discriminative classifiers. The reward signal in this DeepMind experiment is the yes/no logits of a language model. It was discovered that enabling a model to incorporate ensembling and CoT increased performance by sixteen percent.
  • Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress. By using the discrepancy between routing synthetic data creation and oracle model performance, Cohere’s Aya model was able to significantly increase its win rate in comparison to baseline models.
  • Text2SQL is Not Enough: Unifying AI and Databases with TAG. A novel paradigm called Table-Augmented Generation answers complex natural language queries by fusing databases and language models.
  • The Mamba in the Llama: Distilling and Accelerating Hybrid Models. Because mamma models do not include a KV cache for backtracking, they are difficult to accelerate with speculative decoding. This document presents several new distillation techniques and acceleration algorithms from some of the original authors.
  • Efficient LLM Scheduling by Learning to Rank. Head of-line bottlenecks occur when delivering multiple concurrent requests to a large language model since we don’t know how long output generation will take. The shortest requests can be served first if you can learn to rank the relative lengths between them, which will increase throughput for multi-batch generation by 6.5 times.
  • MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders. A new model architecture called MTMamba++ aims to improve multi-task scene understanding. This method captures long-range dependencies and enhances cross-task interactions using a Mamba-based decoder with two core blocks: STM and CTM.

News

Resources

Perspectives

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

--

--

Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence