WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 28 October — 3 November

Meta Introduces Spirit LM, Apple Launches Apple Intelligence on New iMac, Cohere’s Embed 3 Multimodal Search Model, Google’s Invisible Watermark for AI-Generated Text, ChatGPT can search the internet, and much more

Salvatore Raieli
18 min readNov 4, 2024
Photo by Adeolu Eletu on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

49 stories

Research

  • A Theoretical Understanding of Chain-of-Thought. reveals that incorporating both correct and incorrect reasoning paths in demonstrations enhances the accuracy of intermediate steps and Chain-of-Thought (CoT) processes. The new approach, Coherent CoT, substantially boosts performance across multiple benchmarks. Specifically, Gemini Pro shows a 6.60% improvement on the Tracking Shuffled Objects dataset (rising from 58.20% to 64.80%), while DeepSeek 67B achieves a 6.17% increase on the Penguins in a Table dataset (from 73.97% to 80.14%).
  • LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering. improves RAG’s comprehension of long-context knowledge, incorporating global insights and factual specifics. It features a hybrid retriever, an LLM-enhanced information extractor, a Chain-of-Thought (CoT) guided filter, and an LLM-augmented generator. These core components empower the RAG system to extract global long-context information and accurately capture factual details. LongRAG demonstrates superior performance, surpassing long-context LLMs by 6.94%, advanced RAG by 6.16%, and Vanilla RAG by 17.25%.
  • Evaluating feature steering: A case study in mitigating social biases. examines feature steering in LLMs through an experiment that adjusts various features to observe shifts in model outputs, specifically focusing on 29 features related to social biases to determine if feature steering can reduce these biases. Findings reveal that while feature steering can sometimes cause unintended effects, incorporating a neutrality feature effectively reduces social biases across 9 social dimensions without compromising text quality.
  • Large Language Models Reflect the Ideology of Their Creators. reveals that LLMs display varied ideological perspectives, often mirroring the worldview of their creators. It observes consistent normative differences in responses when the same LLM operates in Chinese versus English and highlights normative disagreements between Western and non-Western LLMs regarding prominent figures in geopolitical conflicts.
  • Scalable watermarking for identifying large language model outputs. introduces SynthID-Text, a text-watermarking approach designed to maintain text quality in LLM outputs, achieve high detection accuracy, and reduce latency. It incorporates watermarking through speculative sampling, using a final score pattern for model word choices alongside adjusted probability scores. The authors evaluate the method’s feasibility and scalability by analyzing feedback on nearly 10 million Gemini responses.
  • A Comparative Study on Reasoning Patterns of OpenAI’s o1 Model. outperformed other test-time computing methods across most datasets. The authors note that the primary reasoning patterns in o1 are divide and conquer and self-refinement, with the model adapting its reasoning strategy to specific tasks. For commonsense reasoning, o1 frequently employs context identification and focuses on constraints, while for math and coding tasks, it predominantly utilizes method reuse and divide-and-conquer approaches.
  • Sparse Crosscoders for Cross-Layer Features and Model Diffing. Crosscoders are an advanced form of sparse autoencoders designed to enhance the understanding of language models’ internal mechanisms.
  • Distill Visual Chart Reasoning Ability from LLMs to MLLMs. Code-as-Intermediary Translation (CIT) is an innovative technique aimed at improving visual reasoning in multimodal language models (MLLMs) by leveraging code to convert chart visuals into textual descriptions.
  • Probabilistic Language-Image Pre-Training. Probabilistic Language-Image Pre-training (ProLIP) is a vision-language model (VLM) designed to learn probabilistically from image-text pairs. Unlike traditional models that rely on strict one-to-one correspondence, ProLIP captures the complex many-to-many relationships inherent in real-world data.
  • A faster, better way to train general-purpose robots. MIT researchers have developed Heterogeneous Pretrained Transformers (HPT), a novel model architecture inspired by large language models, designed to train adaptable robots by utilizing data from multiple domains and modalities.
  • A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs. In this work, DeepMind demonstrates how a small language model can be used to provide soft supervision labels and identify informative or challenging data points for pretraining, significantly accelerating the pretraining process.
  • NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction. The NeuroClips framework introduces advancements in reconstructing continuous videos from fMRI brain scans by decoding both high-level semantic information and fine-grained perceptual details.
  • Machine-guided design of cell-type-targeting cis-regulatory elements. A generalizable framework to prospectively engineer cis-regulatory elements from massively parallel reporter assay models can be used to write fit-for-purpose regulatory code.

News

Resources

  • Agentic Information Retrieval. offers an overview of agentic information retrieval, driven by the abilities of LLM agents; explores various advanced applications of agentic information retrieval and addresses related challenges.
  • Aya Expanse. introduces a suite of open-weight foundation models designed for multilingual proficiency, featuring 8B and 32B parameter models and one of the largest multilingual datasets to date, containing 513 million examples. The release also includes Aya-101, which is claimed to be the most extensive multilingual model, supporting 101 languages. Aya Expanse 32B surpasses the performance of Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B, even though it is half the size of the latter.
  • A Survey on Data Synthesis and Augmentation for Large Language Models. offers an in-depth overview of data generation techniques throughout the LLM lifecycle, covering topics such as data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and practical applications.
  • granite-3.0-language-models. introduces a range of lightweight foundation models from 400 million to 8 billion parameters, optimized for tasks such as coding, retrieval-augmented generation (RAG), reasoning, and function calling. Designed for enterprise applications, these models support on-premise and on-device deployment, showing robust performance across academic benchmarks in language understanding, reasoning, coding, function calling, and safety.
  • Pixtral-12B-Base-2409. Pixtral 12B base model weights have been released on Hugging Face.
  • Arcade, a new AI product creation platform, designed this necklace. Arcade AI has developed a generative platform that allows users to create distinctive, high-quality jewelry items simply from text prompts — and the exciting part is, that you can purchase the designs you generate.
  • Retrieval-Augmented Diffusion Models for Time Series Forecasting. The Retrieval-Augmented Time Series Diffusion model (RATD) introduces a retrieval and guidance mechanism to enhance stability and performance in time series diffusion models. RATD operates in two steps: first, it retrieves relevant historical data from a database, and then uses this information as a reference to guide the denoising phase.
  • NotebookLlama: An Open Source version of NotebookLM. Meta has published a quick start guide to help users build a simplified version of Google’s popular NotebookLM system.
  • How I Studied LLMs in Two Weeks: A Comprehensive Roadmap. This article presents a 14-day roadmap for mastering LLM fundamentals, covering key topics such as self-attention, hallucinations, and advanced methods like Mixture of Experts. It offers resources for building an LLM from the ground up, alongside curated literature and online materials, all organized within a GitHub repository. Emphasizing a tailored learning experience, the article underscores the importance of foundational skills in math, programming, and deep learning.
  • Marly. Marly is an open-source data processor that enables agents to query unstructured data using JSON, streamlining data interaction and retrieval.
  • LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias. It was previously believed that novel view synthesis depended heavily on strong 3D inductive biases. This study demonstrates that, with scale and a minimal inductive bias, it’s possible to significantly surpass these previously assumed limitations.
  • Continuous Speech Synthesis using per-token Latent Diffusion. Autoregressive models continue to excel in many applications, yet recent advancements with diffusion heads in image generation have led to the concept of continuous autoregressive diffusion. This research broadens the scope of per-token diffusion to accommodate variable-length outputs.
  • CDChat: A Large Multimodal Model for Remote Sensing Change Description. This paper presents a change description instruction dataset aimed at fine-tuning large multimodal models (LMMs) to enhance change detection in remote sensing.
  • IC-Light V2 (Flux-based IC-Light models). IC Light currently offers the most effective method for associating images with a pre-trained text-to-image backbone. This discussion marks the initial steps toward expanding that capability to the robust Flux models.
  • The Scene Language: Representing Scenes with Programs, Words, and Embeddings. Creating 3D scenes from scratch presents significant challenges, including data limitations. This research introduces a programming-like language for describing 3D scenes and demonstrates that Claude Sonnet can produce highly realistic scenes even without specific training for this task.
  • 3D Semantic Segmentation. FtD++ is a cross-modal learning approach designed to enhance unsupervised domain adaptation in 3D semantic segmentation tasks.
  • Open source replication of crosscoder on Gemma 2B. Anthropic recently published two studies showcasing its novel interpretability method. This post provides an open replication of the cross coder on the Gemma 2B model.
  • Awesome-Graph-OOD-Learning. This repository lists papers on graph out-of-distribution learning, covering three primary scenarios: graph OOD generalization, training-time graph OOD adaptation, and test-time graph OOD adaptation.
  • OpenWebVoyager: Building Multimodal Web Agents. OpenWebVoyager offers tools, datasets, and models designed to build multimodal web agents that can navigate and learn from real-world web interactions.
  • Automated Colorization for Animation. Researchers have introduced an innovative inclusion-matching technique that overcomes challenges in automated colorization, particularly for animations where occlusions and wrinkles complicate traditional segment matching.
  • Lofi Music Dataset. A dataset containing music clips paired with detailed text descriptions, generated by a music creation model.
  • Learning to Handle Complex Constraints for Vehicle Routing Problems. Researchers have developed a Proactive Infeasibility Prevention (PIP) framework designed to enhance neural network performance on Vehicle Routing Problems (VRPs) that involve challenging constraints.
  • Unleashing the Power of AI on Mobile: LLM Inference for Llama 3.2 Quantized Models with ExecuTorch and KleidiAI. PyTorch has made significant strides with ExecuTorch, a tool that enables AI model deployment at the edge, greatly enhancing the performance and efficiency of various end systems.
  • CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution. CompassJudger-1 is the first open-source, comprehensive judge model created to enhance the evaluation process for large language models (LLMs).
  • MINT-1T. MINT-1T, a vast open-source multimodal dataset, has been released with one trillion text tokens and 3.4 billion images, incorporating diverse content from HTML, PDFs, and ArXiv papers. This dataset, roughly ten times larger than previous collections, is intended to accelerate advancements in large-scale multimodal machine learning research.
  • LARP: Tokenizing Videos 🎬 with a Learned Autoregressive Generative Prior 🚀. LARP is a novel video tokenizer designed to enhance video generation in autoregressive (AR) models by prioritizing global visual features over individual patch-based details.
  • OpenAI’s new hallucination benchmark. OpenAI has released the SimpleQA benchmark, which measures models’ abilities around simple factual questions.
  • ThunderKittens. Thunder Kittens is a framework designed for creating highly efficient GPU kernels. It leverages the principle that GPUs are optimized for working with compact 16x16 data tiles, resulting in high usability. With this approach, achieving 40% faster kernels requires only a few hundred lines of code.
  • Skinned Motion Retargeting with Dense Geometric Interaction Perception. MeshRet has developed an innovative method for enhancing motion retargeting for 3D characters, prioritizing the preservation of body geometry interactions from the outset.
  • Unlocking the Capabilities of Masked Generative Models for Image Synthesis via Self-Guidance.Researchers have improved Masked Generative Models (MGMs) by introducing a self-guidance sampling technique, which enhances image generation quality without compromising diversity.
  • Speeding Up Transformers with Token Merging. This project presents PiToMe, an algorithm that compresses Vision Transformers by gradually merging tokens after each layer, thereby decreasing the number of tokens processed.
  • PF3plat : Pose-Free Feed-Forward 3D Gaussian Splatting. PF3plat addresses the challenge of 3D reconstruction and novel view synthesis from RGB images without requiring additional data.
  • Fine-tuning LLMs to 1.58bit: extreme quantization made easy. BitNet, created by Microsoft Research, presents a transformer architecture that lowers the computational and memory demands of large language models by employing ternary precision (-1, 0, 1), equating to 1.58 bits per parameter. This architecture requires models to be trained from scratch, but it can also fine-tune existing models to this low-precision format while retaining high performance on downstream tasks. This technique greatly reduces energy consumption and enhances inference speed through specialized kernels that enable efficient matrix multiplication.
  • SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Recognition. SELECT is the inaugural extensive benchmark designed to evaluate various data curation methods in image classification. ImageNet++ is a newly developed dataset that augments ImageNet-1K by incorporating five additional training data variations, each curated through distinct techniques.
  • ODRL: A Benchmark for Off-Dynamics Reinforcement Learning. ODRL is the first standardized benchmark designed to assess reinforcement learning methods in environments with differing dynamics.
  • Text-to-Image Model to Generate Memes. Researchers have created an innovative adapter method for text-to-image models, enabling them to tackle complex tasks such as meme video generation while preserving the base model’s strong generalization abilities.
  • Anomaly Classification in Industry. AnomalyNCD is a multi-class anomaly classification framework intended to enhance traditional anomaly detection techniques in industrial environments.
  • MrT5: Dynamic Token Merging for Efficient Byte-level Language Models. Byte-level language models represent a move toward a token-free future, but the challenge of sequence length remains significant. Dynamically merging tokens can help increase the number of tokens within the context.
  • BART vectoriZed. A new GPU-enabled implementation of Bayesian Additive Regression Trees (BART) significantly accelerates processing speed, making it up to 200 times faster than conventional CPU-based versions.
  • Huge new Diffusers release. The Hugging Face Diffusers package now includes new pipelines like Flux, Stable Audio, Kolors, CogVideoX, Latte, and others, alongside new methods such as FreeNoise and SparseCtrl, plus various refactors.
  • 4 experiments with voice AI models to help you explore culture. Google’s voice AI models allow users to engage with culture in innovative ways. Projects like Talking Tours provide AI-guided virtual tours, Mice in the Museum offers art narration, and Lip Sync animates lips to discuss cultural topics. These entertaining tools offer new perspectives on art and design.

Perspectives

  • ByteDance intern fired for planting malicious code in AI models. After rumors swirled that TikTok owner ByteDance had lost tens of millions after an intern sabotaged its AI models, ByteDance issued a statement this weekend hoping to silence all the social media chatter in China.
  • Thinking Like an AI. Large language models (LLMs) operate as advanced autocomplete systems, generating the next token based on a combination of their training data and current input. Small variations in input can influence predictions, resulting in different responses to the same question. Gaining insight into token prediction, training data context, and memory constraints can enhance effective AI usage.
  • An Interview with Salesforce CEO Marc Benioff about AI Abundance. Salesforce CEO Marc Benioff recently spoke about the company’s new AI initiative, Agentforce, showcasing its potential to transform enterprise applications and customer interactions. He contrasted Salesforce’s approach with Microsoft’s Copilot, describing Salesforce’s solution as more cohesive and impactful, thanks to its strong platform and data infrastructure. During the interview, Benioff stressed the significance of AI-driven “agentic” layers designed to boost customer service and improve operational efficiency across various industries.
  • How GPU Access Helps Startups Be Agile. Andreessen Horowitz’s Oxygen program tackles GPU shortages by offering startups in its portfolio more accessible and flexible GPU resources, allowing them to bypass price surges and supply limitations. This initiative enables AI startups to concentrate on product development without the pressure of long-term capital expenditure, emphasizing the need for equitable access to critical resources in the competitive AI field.
  • The Mask Comes Off: At What Price? OpenAI is approaching its shift to a Public Benefit B-Corporation, a move that could impact its investor dynamics and collaboration with Microsoft. This transition brings up questions around control and valuation, particularly concerning the nonprofit’s stake, which could be substantial given OpenAI’s role in advancing AGI. The company’s future profitability and strategic course are closely tied to the safe development of AGI, a pursuit with enormous potential value.
  • What’s so special about the human brain? Torrents of data from cell atlases, brain organoids, and other methods are finally delivering answers to an age-old question.
  • ‘Educational’ apps are worth billions. We need to make sure they work. Partnerships between developers and researchers could help to improve the quality of educational apps and other technologies.
  • The huge protein database that spawned AlphaFold and biology’s AI revolution. Pioneering crystallographer Helen Berman helped to set up the massive collection of protein structures that underpins the Nobel-prize-winning tool’s success.
  • Extreme fire seasons are looming — science can help us adapt. Not all wildfires can be averted, but data, models, and collaborations can help to chart a course to a fire-resilient future.
  • AI-designed DNA sequences regulate cell-type-specific gene expression. Researchers have used artificial intelligence models to create regulatory DNA sequences that drive gene expression in specific cell types. Such synthetic sequences could be used to target gene therapies to particular cell populations.
  • Pushing the frontiers of audio generation. DeepMind has shared additional details about the audio generation models behind NotebookLM.
  • Evaluating feature steering: A case study in mitigating social biases. This study investigates the use of feature steering in AI models to adjust outputs in an interpretable way. It identifies a “steering sweet spot,” where modifications do not compromise performance. Results demonstrate that steering can adjust social biases within specific areas but may also produce unintended effects outside those targets. Continued research is necessary to enhance feature steering, aiming for safer and more dependable AI outcomes.
  • How we saved hundreds of engineering hours by writing tests with LLMs. Assembled leverages LLMs to speed up and enhance software testing, allowing tests to be generated in minutes rather than hours. This approach boosts engineering productivity, saving time and enabling a stronger focus on feature development. LLMs create thorough and precise tests that uphold code quality and sustain development speed.
  • How to train LLM as a judge to drive business value.” LLM As a Judge” is an approach for leveraging an existing language model to rank and score natural language. This post provides guidelines for effectively using this method to process or assess data.

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

--

--

Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence