WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 15–22 September

Salvatore Raieli
20 min readSep 24, 2024

Sam Altman Announces OpenAI Structural Changes, Google DeepMind’s Dexterous Robots, LinkedIn Training AI Models with User Data, and much more

Photo by Joao Cruz on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

Research

  • Introducing Chai-1: Decoding the molecular interactions of life.A novel multi-modal foundation model for predicting molecular structures, capable of handling proteins, small molecules, DNA, RNA, and more. It delivers state-of-the-art performance across various tasks in drug discovery, achieving a 77% success rate on the PoseBusters benchmark (compared to 76% by AlphaFold 3) and a Cα LDDT score of 0.849 on the CASP15 protein monomer structure prediction set (outperforming ESM3–98B’s 0.801).
  • Knowing When to Ask — Bridging Large Language Models and Data. It incorporates a series of fine-tuned Gemma 2 models to enable LLMs to access and utilize numerical and statistical data effectively. A new method called Retrieval Interleaved Generation (RIG) is introduced, allowing LLMs to reliably integrate public statistical data from Data Commons into their responses. RIG, a tool-based approach, interleaves statistical tokens with natural language queries for optimal retrieval from Data Commons. To achieve this, the LLM is fine-tuned on an instruction-response dataset created with the assistance of Gemini 1.5. This RIG technique enhances factual accuracy from 5–7% to approximately 58%.
  • Agent Workflow Memory. It introduces Agent Workflow Memory to capture and provide commonly reused workflows to the agent as needed, guiding the agent’s future generations. This mechanism operates both offline and online, drawing inspiration from how humans learn and reuse workflows from past experiences to inform future actions. It reportedly boosts performance, improving baseline results by 24.6% and achieving a 51.1% relative success rate on Mind2Web and WebArena, all while being more efficient.
  • LLaMA-Omni: Seamless Speech Interaction with Large Language Models. A model architecture designed for low-latency speech interaction with LLMs, built on Llama-3.1–8B-Instruct, which can simultaneously generate both text and speech responses from speech instructions. It achieves response latency as low as 226ms. The architecture includes a speech encoder (Whisper-large-v3), a speech adaptor, an LLM, and a speech decoder. Additionally, they developed a dataset of 200,000 speech interactions and responses to support the model’s training.
  • Diagram of Thought: Iterative Reasoning in Language Models. The Diagram of Thought (DoT) framework presents a novel approach for large language models to reason by structuring ideas within a directed acyclic graph (DAG). This technique enables models to propose, critique, refine, and verify ideas, enhancing logical consistency and reasoning capabilities.
  • V-STaR: Training Verifiers for Self-Taught Reasoners. V-STaR is an innovative method for enhancing large language models by leveraging both correct and incorrect solutions generated during self-improvement. These solutions are used to train a verifier, which then selects the optimal solution during inference. This approach has demonstrated notable improvements in accuracy on benchmarks for code generation and mathematical reasoning, potentially providing a more efficient way to boost LLM performance compared to existing methods.

News

Resources

  • What is the Role of Small Models in the LLM Era: A Survey. It closely explores the connection between LLMs and SLMs, highlighting common applications of SLMs such as data curation, enhancing model training, improving inference efficiency, serving as evaluators, retrievers, and more. The study provides valuable insights for practitioners, helping them better grasp the importance and utility of SLMs.
  • Theory, Analysis, and Best Practices for Sigmoid Self-Attention. It introduces Flash-Sigmoid, a hardware-optimized, memory-efficient implementation of sigmoid attention, offering up to a 17% speed-up in inference kernels compared to FlashAttention-2 on H100 GPUs. The results demonstrate that SigmoidAttn performs on par with SoftmaxAttn across various tasks and domains.
  • Achieving Peak Performance for Large Language Models: A Systematic Review. A comprehensive review of techniques for enhancing and accelerating LLMs from three perspectives: training, inference, and system serving. It provides an overview of the latest optimization and acceleration strategies, covering advancements in training methods, hardware utilization, scalability, and system reliability.
  • Grounding AI in reality with a little help from Data Commons. Google has introduced Retrieval-Augmented and Retrieval-Interleaved Generation through Gemma 2, enhancing these techniques with access to numerous external data sources. This guide focuses on the fine-tuning process.
  • AudioBERT: Audio Knowledge Augmented Language Model.AuditoryBench is a newly developed dataset designed to evaluate auditory knowledge and understanding in language models.
  • Learn GPU Programming in Your Browser. Answer AI utilizes WebGPU and its new gpu.cpp program to bring GPU puzzles to the web, offering a valuable resource for learning. These puzzles guide learners step-by-step through the process of programming GPUs.
  • FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally. FlashSplat is an innovative technique for 3D Gaussian Splatting segmentation that removes the requirement for time-consuming gradient descent processes.
  • PiEEG-16, a new tool for neuroscience. The PIEEG-16 is a new, affordable shield for Raspberry Pi, enabling real-time measurement and processing of biosignals such as EEG, EMG, and ECG. It offers exciting possibilities for neuroscience research and brain-computer interface experiments without relying on network data transfer.
  • ODAQ: Open Dataset of Audio Quality. ODAQ is a dataset designed to tackle the lack of openly available collections of audio signals paired with subjective scores that reflect perceived quality.
  • iSeg: An Iterative Refinement-based Framework for Training-free Segmentation. iSeg is a framework for training-free image segmentation that improves Stable Diffusion’s capability to generate segmentation masks, enabling more precise image segmentation without the need for additional training.
  • InstantDrag: Improving Interactivity in Drag-based Image Editing. Editing images can be challenging because of the continuous nature of pixels. This research builds upon previous work in drag-based editing by using user-defined control points to adjust images. While earlier methods were often slow, this paper introduces significant speed improvements, making the process much faster.
  • Apollo: Band-sequence Modeling for High-Quality Music Restoration in Compressed Audio. Many compression formats tend to reduce music quality, particularly at low bitrates. This method introduces a new approach that significantly enhances the quality of music after it has undergone compression.
  • DiffFAS: Face Anti-Spoofing via Generative Diffusion Models. DiffFAS is a novel framework designed to address domain shift challenges in facial anti-spoofing systems. It breaks down domain shifts into two components: image quality and style. By generating high-fidelity attack faces, the system enhances performance across various domains and spoofing attack types.
  • HTR-VT: Handwritten Text Recognition with Vision Transformer. Researchers have introduced a data-efficient Vision Transformer (ViT) approach for handwritten text recognition. This method combines Convolutional Neural Networks (CNN) for feature extraction with a Sharpness-Aware Minimization (SAM) optimizer to enhance performance and accuracy.
  • vae-explainer. Learn how Variational Autoencoders (VAE) work by visualizing one running in your browser
  • SeekTune. Open source implementation of Shazam song search
  • jinaai/jina-embeddings-v3. The Jina series of embeddings is a robust and high-quality set of models designed for embedding and retrieval tasks. The development team has launched the latest version of their model, featuring enhanced performance and training capabilities.
  • Trustworthiness of RAG Systems. This study presents a framework for assessing the trustworthiness of Retrieval-Augmented Generation (RAG) systems, focusing on six critical aspects: factuality, robustness, fairness, transparency, accountability, and privacy.
  • beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems. The beeFormer framework enhances sentence Transformers by integrating interaction data, increasing their effectiveness in recommender systems.
  • Awesome Comics Understanding. The final challenge for Visual Language Models is achieving the ability to comprehend and reason about comics. This project serves as both a survey and a call to action for further research in this area.
  • WordLlama. WordLlama is a fast, lightweight NLP toolkit that handles tasks like fuzzy-deduplication, similarity, and ranking with minimal inference-time dependencies and is optimized for CPU hardware.
  • Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT. This project advances speech representation learning by disentangling syllabic structures from speaker-specific information in self-supervised models. By fine-tuning the HuBERT model using speaker perturbation techniques, researchers enhanced syllable segmentation, resulting in improved organization of syllabic units.
  • 🎥 Surveillance Video Summarizer: AI-Powered Video Analysis and Summarization. A custom-trained model based on Florence 2 is designed to summarize CCTV and surveillance footage, providing accurate, real-time updates on activities and events as they occur.
  • Fine-tuning LLMs to 1.58bit: extreme quantization made easy. The Hugging Face team employed a new technique called quantization warm-up to fine-tune Llama 3 8B, achieving the same performance as Llama 1 while reducing the model to use just 1.58 bits per parameter through quantization.
  • ZML Inference. ZML is a highly efficient inference engine developed in Zig, optimized for speed and performance. While it supports various models, some customization is necessary to make it compatible with new architectures.
  • Adversarial Attacks on Navigation Agents. This repository presents a novel attack method for embodied navigation agents, which involves applying transparent patches with learnable textures to target objects. These patches are designed to disrupt the agent’s navigation by manipulating its perception of the environment.
  • Deep Graph Anomaly Detection: A Survey and New Perspectives. This paper provides a comprehensive review of deep learning techniques, focusing on graph neural networks (GNNs) for detecting anomalies in graph data. The researchers propose a new taxonomy of methods, examining various GNN architectures, proxy tasks, and anomaly detection metrics.
  • AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing. AceParse is a dataset developed to enhance the parsing of structured texts found in academic papers, with a focus on improving the handling of elements like formulas, tables, and complex sentences.
  • SkinMamba: A Precision Skin Lesion Segmentation Architecture with Cross-Scale Global State Modeling and Frequency Boundary Guidance. SkinMamba is a hybrid model that integrates convolutional neural networks (CNN) with Transformer-based techniques to enhance skin lesion segmentation, aiding in early cancer detection.
  • Vista3D: Unravel the 3D Darkside of a Single Image. Vista3D is a newly developed framework that creates 3D models from a single image in just 5 minutes. It employs a two-phase process: first, it generates rough geometry, and then it refines the details to capture both visible and hidden features of objects. This approach enables more comprehensive 3D reconstructions.
  • PhysMamba. PhysMamba is an innovative framework developed for remote heart monitoring using facial videos, specifically designed to overcome the challenges of capturing physiological signals from a distance. This technology enhances the ability to monitor heart health remotely with greater accuracy and reliability.
  • General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model. This is a remarkable breakthrough in general-purpose optical character recognition (OCR), offering exceptional performance in reading text from images. The latest version significantly enhances OCR capabilities, especially for challenging “in-the-wild” scenarios, delivering much-improved accuracy and reliability.
  • Fish Speech. A powerful voice generation and single-shot voice cloning tool has been introduced, offering completely open-source accessibility. It is designed to be easy to set up and use, enabling efficient and high-quality voice replication with minimal input.
  • 1xgpt. Genie is a video generation tool designed for world model systems. 1x Robotics has open-sourced a version that closely mirrors the one it developed and trained in-house, making it accessible for wider use in various applications.
  • OpenAI Says It’s Fixed Issue Where ChatGPT Appeared to Be Messaging Users Unprompted. A Reddit user claimed that OpenAI’s ChatGPT started a conversation without any prompt, sparking speculation about potential new engagement features. OpenAI acknowledged the incident and released a fix, attributing it to a glitch related to unsent messages. However, the authenticity of the event remains debated, as other users have reported similar occurrences.
  • Announcing Pixtral 12B. Pixtral 12B — the first-ever multimodal Mistral model.
  • Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models. Promptriever is a pioneering retrieval model that can be prompted similarly to a language model. This innovation allows users to interact with the retrieval process more flexibly and intuitively, bridging the gap between traditional retrieval models and language models for enhanced information access.

Perspectives

  • What’s so funny about getting an AI app to give you a roasting? Roasting can be really brutal, but at least if we inflict it on ourselves, we can get ahead of the joke
  • Artificial intelligence will affect 60 million US and Mexican jobs within the year. IDB study shows the impact that AI will have on the labor market. Women and low-skilled workers are more vulnerable to being replaced
  • Generative AI is reportedly tripling carbon dioxide emissions from data centers. Research suggests data centers will emit 2.5 billion tons of greenhouse gas by 2030
  • A review of OpenAI o1 and how we evaluate coding agents. Devin, an AI coding agent, was tested using OpenAI’s new o1 models, demonstrating enhanced reasoning and error diagnosis capabilities compared to GPT-4o. The o1-preview model enables Devin to better analyze, backtrack, and minimize hallucinations. Although it has yet to be integrated into production systems, early results show notable improvements in autonomous coding tasks.
  • OpenAI’s new models ‘instrumentally faked alignment’. OpenAI’s latest AI models, o1-preview and o1-mini, demonstrate advanced reasoning abilities, particularly in fields like math and science. However, these models also pose heightened risks, including reward hacking and potential misuse of biological threats. While OpenAI highlights that these models are more robust than earlier versions, they also acknowledge the growing concerns surrounding their potential dangers.
  • The Button Problem of AI. Despite the initial excitement, AI tools like GPT-4 have resulted in only incremental productivity improvements rather than transformative changes. AI is often reduced to “buttonified” tasks, addressing small, isolated functions that limit its broader impact on workflows. To fully unlock AI’s potential, successful startups may need to go beyond these current applications and drive more innovative solutions.
  • Something New: On OpenAI’s “Strawberry” and Reasoning. OpenAI’s new o1-preview AI, part of the “Strawberry” enhanced reasoning system, demonstrates remarkable ability in tackling complex problems that involve planning and iteration, even surpassing human experts in fields like advanced physics. Although it still faces challenges, such as occasional errors and hallucinations, it represents a major advancement in AI’s capacity to independently find solutions. As AI systems grow more autonomous, professionals will need to adjust to new roles focused on guiding and verifying AI-generated outputs.
  • A US semiconductor industry in crisis needs a workforce that doesn’t yet exist. As the federal government spurs the re-shoring of semiconductor manufacturing in the US, the industry faces a hard fact: schools haven’t been training the workers.
  • The Data Pipeline is the New Secret Sauce. As models become increasingly commoditized, the competitive edge in AI now largely stems from the data itself and, consequently, from the pipeline that ingests and processes this data. This post explores the challenges and opportunities that arise in managing data pipelines in today’s landscape.
  • Why Copilot is Making Programmers Worse at Programming. AI tools such as GitHub Copilot boost programming productivity but may undermine critical coding skills. Relying too heavily on AI-generated code can introduce quality, security, and maintainability concerns while diminishing learning opportunities. Additionally, these tools might restrict creative problem-solving and create a misleading sense of expertise among developers.
  • AI model collapse might be prevented by studying human language transmission. Using data generated by one artificial intelligence (AI) model to train others eventually leads to ‘model collapse’, in which the models lose information about the real world. Researchers studying this phenomenon should draw on insights from cognitive science.
  • Forget ChatGPT: why researchers now run small AIs on their laptops. Artificial intelligence models are typically used online, but a host of openly available tools is changing that. Here’s how to get started with local AIs.
  • Jumping Over AI’s Uncanny Valley. This article delves into the Uncanny Valley theory, which posits that near-human AI can evoke discomfort, potentially slowing its adoption. It analyzes recent AI developments that highlight this psychological effect, raising concerns about its influence on AI’s future. The article concludes by suggesting that AI might be most effective in a complementary role, rather than as a direct replacement for humans.
  • Scaling: The State of Play in AI. Large language models (LLMs) like ChatGPT and Gemini are becoming more powerful as they scale in size, data, and computational resources, resulting in enhanced performance across a wide range of tasks. Current Gen2 models, such as GPT-4 and Claude 3.5, dominate the market, with next-gen models (Gen3) expected to further elevate both capabilities and associated costs. A recent breakthrough in scaling laws, which emphasizes increased “thinking” during inference, holds the potential to drive even greater improvements in AI performance beyond traditional model training approaches.
  • The Work From Home Free-for-All Is Coming to an End. Amazon’s CEO just called everyone back to the office full-time. If you thought your two days a week at home were safe, think again.
  • AI has returned chipmaking to the heart of computer technology. And the technological challenges are bigger than the political ones, argues Shailesh Chitnis

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

--

--

Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence