WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 21–27 October

Amazon Introduces AI-Generated Audio Ads, Claude AI Introduces New Capabilities, and much more

Salvatore Raieli
21 min readOct 28, 2024
Photo by Myznik Egor on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

49 stories

Research

  • Thinking LLMs: General Instruction Following with Thought Generation. The proposed training method aims to enhance LLMs with thinking capabilities for general instruction-following without relying on human-annotated data. It employs an iterative search and optimization process to facilitate thought generation, allowing the model to learn without direct supervision. For each user instruction, potential thoughts are evaluated using a judge model, which scores only the responses to identify the best and worst options. The resulting full outputs are then used as selected and rejected pairs for DPO (termed Thought Preference Optimization in this paper). This approach demonstrates superior performance on AlpacaEval and Arena-Hard.
  • Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence. A new collaborative search algorithm is proposed to adapt LLMs using swarm intelligence, where a group of LLM experts collaboratively navigates the weight space to optimize a utility function that reflects various adaptation objectives. Experiments show that Model Swarms can effectively adjust LLM experts for a single task, multi-task domains, reward models, and a range of human interests. This approach outperforms 12 model composition baselines by up to 21.0% across different tasks and contexts.
  • First-Person Fairness in Chatbots. This study explores first-person fairness, focusing on the fairness of interactions between users and ChatGPT, particularly examining any biases related to users’ names. It utilizes a model powered by GPT-4o to analyze patterns and name sensitivity in the chatbot’s responses based on different user names. The findings suggest that post-training significantly reduces harmful stereotypes overall. However, in areas such as entertainment and art, especially with open-ended tasks, the study reveals a higher level of bias, indicating a tendency to create narratives featuring protagonists whose gender aligns with the gender inferred from the user’s name.
  • Looking Inward: Language Models Can Learn About Themselves by Introspection. The report indicates that LLMs can gain knowledge through introspection that is not directly derivable from their training data. It suggests that these models possess privileged information about themselves, which could contribute to creating more interpretable and controllable systems. However, it also notes that this introspective ability has limitations, as models often struggle to predict their own behavior on tasks that require reasoning over extended outputs.
  • Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation. This proposal introduces a unified autoregressive framework for multimodal understanding and generation, which decouples visual encoding into independent pathways. Utilizing a single transformer architecture enhances flexibility and performance in both visual understanding and generation tasks. The framework claims to mitigate the trade-offs typically associated with vision tasks found in methods relying on a single visual encoder. As a result, it outperforms previous unified models and matches or exceeds the performance of task-specific models.
  • Inference Scaling for Long-Context Retrieval Augmented Generation. This study employs two strategies to explore scaling laws for Retrieval-Augmented Generation (RAG): in-context learning (DRAG) and iterative prompting (IterRAG). It discovers that RAG performance steadily enhances with an increase in effective context length when configurations are optimized. Additionally, under optimal conditions, increasing inference computation yields linear improvements in long-context RAG performance. This insight leads to the creation of a computation allocation model designed to offer practical guidance for optimal computation distribution in long-context RAG situations.
  • Agent S: An Open Agentic Framework that Uses Computers Like a Human. A novel open agentic framework has been developed to facilitate autonomous interactions with computers via a graphical user interface (GUI). Named Agent S, this framework addresses challenges such as knowledge acquisition, long-horizon planning, and managing dynamic interfaces. It introduces experience-augmented hierarchical planning that combines search and retrieval methods. Additionally, it utilizes an agent-computer interface to enable reasoning and control over GUI agents. Evaluation on the OSWorld benchmark demonstrates that Agent S surpasses the baseline by 9.37% in success rate, representing an 83.6% relative improvement, and sets a new state-of-the-art performance.
  • Exploring Model Kinship for Merging Large Language Models. The study introduces the concept of model kinship to assess the similarity between LLMs. This measure is utilized to develop a model merging strategy called Top-k Greedy Merging with Model Kinship, which enhances performance. The authors discover this new criterion allows for effective and continuous model merging.
  • On The Planning Abilities of OpenAI’s o1 Models: Feasibility, Optimality, and Generalizability. The report highlights that the o1-preview model excels in self-evaluation and constraint-following. However, it also points out that these o1 models exhibit bottlenecks in decision-making and memory management, particularly in the context of spatial reasoning. Specifically, the models tend to generate redundant actions and face challenges in generalizing across spatially complex tasks.
  • Sabotage evaluations for frontier models. Anthropic has conducted several innovative evaluations to identify vulnerabilities and assess misalignment in large, powerful models.
  • Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities. A powerful open-source initiative aimed at replicating GPT-4’s speech capabilities has emerged. This model was trained by aligning multiple modalities using pre-trained audio and speech encoders, allowing it to achieve advanced speech recognition and generation functionalities.
  • Automatically Interpreting Millions of Features in Large Language Models. Interpreting SAE features on a large scale can be difficult. To address this, Eleuther has introduced a set of automatic interpreter features designed to help understand the meaning of elements within their context.
  • Mitigating Object Hallucination via Concentric Causal Attention. Object hallucination in vision-language models has been associated with Rotary Position Encoding (RoPE), which faces challenges in managing long-term dependencies between visual and textual inputs. To overcome this, the authors introduce Concentric Causal Attention (CCA), a novel positional alignment method that enhances the interaction between visual elements and instruction tokens.
  • Simplifying, stabilizing, and scaling continuous-time consistency models. OpenAI has published work focusing on enhancing consistency models, which operate in two steps rather than the 1,000 steps typically used in diffusion models. While these models still depend on distillation from an existing diffusion model, the research seeks to improve their performance and stability as they scale.
  • All you need are 32 tokens to represent video. Salesforce’s new approach introduces a novel video encoder that significantly reduces the number of tokens needed for accurate representation. While similar attempts in the past have seen limited success, the breakthrough appears to come from combining an explicit temporal encoder with a spatial encoder, enabling more efficient video processing.
  • CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing. CoPS is a novel algorithm that improves agents’ sequential reasoning by allowing them to share experiences across various tasks, enhancing their overall learning and adaptability.

News

Resources

  • CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos. This proposal introduces a new point-tracking model along with a semi-supervised training recipe that allows for the use of real videos without annotations during training. It generates pseudo-labels using readily available teacher models. This approach simplifies the architecture and training scheme, resulting in improved outcomes while utilizing 1000 times less data.
  • Meta’s latest open source releases. Meta has introduced a significant array of valuable research tools, including a speech-to-speech model, enhancements to SAM, and numerous other intriguing developments.
  • One-Step Diffusion via Shortcut Models. Shortcut models represent a new category of consistency models that can produce continuous signals with minimal inference steps.
  • Zero-Shot 3D Visual Grounding. VLM-Grounder is a novel approach to 3D visual grounding that addresses the shortcomings of conventional methods by leveraging vision-language models (VLMs) and 2D images.
  • DeepSeek’s natively Multimodal model. DeepSeek has developed and launched a powerful 1.3 billion parameter model capable of processing interleaved text and images for both generation and comprehension.
  • Meta Lingua. Meta has developed an easy-to-use and research-friendly codebase that can replicate Llama 2 7B within 24 hours.
  • Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via Lightweight Value Optimization. LiVO (Lightweight Value Optimization) is an innovative approach designed to align Text-to-Image models with human values.
  • Easily hackable vision language model. A simple and performant VLM implementation in pure PyTorch
  • Anthropic Quickstarts. Anthropic Quickstarts provides developers with projects like a customer support agent and a financial data analyst to help them swiftly utilize the Anthropic API. These projects leverage Claude for natural language processing and incorporate interactive data visualization. Each quickstart comes with setup instructions and encourages contributions from the community.
  • BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities. BiGR is an innovative image generation model that leverages compact binary latent codes to enhance both its generation and representation capabilities. It is the first model to integrate both generative and discriminative tasks within a unified framework. Key features of the model include binary tokenization and a distinctive entropy-ordered sampling technique, which contribute to its improved performance.
  • LongPiBench. LongPiBench is a benchmark created to evaluate positional biases in large language models (LLMs) when handling long contexts. It focuses on identifying biases that stem from the spacing between multiple relevant pieces of information, providing a targeted way to assess how well models handle long-range dependencies in text.
  • CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models. Clamp2 is a contrastive model designed for aligning music and text. It uses contrastive learning techniques to match and relate musical elements with corresponding textual descriptions, enhancing the ability to process and generate music-related text in alignment with audio.
  • bitnet.cpp. Microsoft has released an inference repository for its 1.58-bit models, which, when properly trained, are capable of running efficiently on consumer hardware. This development allows for more accessible deployment of advanced AI models without requiring high-end computational resources.
  • Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning. Montessori-Instruct is a novel framework designed to generate synthetic data that aligns with a student language model’s learning process. It adapts the data produced by the teacher model to fit the student’s learning preferences by leveraging local data influence and Direct Preference Optimization (DPO), optimizing the training experience for the student model.
  • Stable Diffusion 3.5. Stability AI has launched a new series of models featuring enhanced performance and faster speeds. These models come with built-in Diffusers support, allowing for immediate training capabilities
  • 3D-GANTex: 3D Face Reconstruction with StyleGAN3-based Multi-View Images and 3DDFA based Mesh Generation. This paper presents a novel approach for estimating face texture and geometry from a single image by combining StyleGAN with 3D Morphable Models.
  • Moonshine. Moonshine is a family of speech-to-text models optimized for fast and accurate automatic speech recognition (ASR) on resource-constrained devices. It is well-suited to real-time, on-device applications like live transcription and voice command recognition.PocketPal AI.PocketPal AI is a pocket-sized AI assistant powered by small language models (SLMs) that run directly on your phone. Designed for both iOS and Android, PocketPal AI lets you interact with various SLMs without the need for an internet connection.
  • Introducing the prompt() Function: Use the Power of LLMs with SQL! The costs of operating LLMs have dropped considerably, making it feasible to incorporate smaller models like GPT-4o-mini into SQL functions. MotherDuck’s PROMPT() function simplifies tasks such as text generation, summarization, and structured data extraction using OpenAI models. It provides flexibility in balancing cost and performance, while also supporting bulk operations with improved concurrency for more efficient processing.
  • Anthropic Computer Use Demo. A quick example of Claude Sonnet’s 3.5 new computer use capabilities.
  • Introducing SynthID Text. SynthID is a method for statistically watermarking generated text. It employs a pseudorandom function after the top-k and top-p sampling steps to embed a mark within the text. A probabilistic Bayesian approach is then used to detect whether the text has been watermarked, indicating it was produced by a language model.
  • Transformers.js v3: WebGPU Support, New Models & Tasks, and More…. Transformers JS is a JavaScript library designed to run machine learning models, and it now supports WebGPU, offering up to 1,000x faster performance in some cases. The latest version provides access to over 1,200 models, making it well-suited for edge and browser-based applications.
  • Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages. We present Pangea-7B, an open multilingual multimodal language model (MLLM) developed to address multilingual and multicultural challenges in visual understanding tasks. Pangea-7B is trained on PangeaIns, a comprehensive dataset consisting of 6 million instructions across 39 languages.
  • SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree. SAM2Long solves the “error accumulation” problem found in SAM 2’s memory design by implementing a training-free strategy for video object segmentation.
  • Agent.exe. A convenient wrapper for Anthropic’s computer use system simplifies its usage and execution, making it more user-friendly and accessible.
  • TALoS: Enhancing Semantic Scene Completion via Test-time Adaptation on the Line of Sight. TALoS is a method that enhances scene completion for autonomous vehicles by leveraging observations from different time points as supervision for making more accurate predictions.
  • OmniParser for Pure Vision Based GUI Agent. Screenshot parsing tool for models to use digital interfaces.
  • Introducing quantized Llama models with increased speed and a reduced memory footprint. Meta has optimized its 1B and 3B language models by applying quantization, achieving a 2–4x speed increase and reducing the model size by over 50% with minimal quality loss. This improvement is made possible by its quantization-aware training setup, allowing the models to adapt to lower precision effectively.
  • Joint Point Cloud Upsampling and Cleaning with Octree-based CNNs. An effective and straightforward approach for upsampling and refining point clouds utilizes a modified octree-based 3D U-Net, known as OUNet.
  • ExecuTorch. ExecuTorch supports on-device inference across mobile and edge devices, including wearables, embedded systems, and microcontrollers. It facilitates the efficient deployment of PyTorch models to edge environments and is compatible with various computing platforms, leveraging hardware capabilities like CPUs, NPUs, and DSPs. Comprehensive tutorials provide guidance on using ExecuTorch step-by-step.
  • Federated Transformer (FeT). The Federated Transformer (FeT) is a novel framework aimed at enhancing both performance and privacy in Vertical Federated Learning (VFL) across multiple collaborating parties.
  • ADEM-VL. ADEM-VL is an innovative vision-language model created to address hardware constraints found in current models.
  • Predicting Weight Loss with Machine Learning. The author utilized a straightforward feedforward DNN model to monitor and forecast weight loss on a ketogenic diet. This model effectively captured the non-linear weight loss trends, fit a predictive function to the data, and visualized calorie metrics. For added insights, the Harris-Benedict Equation was applied to compare estimated calorie needs with actual weight loss.
  • Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent. Google Gemini’s AI Studio can accurately extract numerical data from video screen recordings of emails. This process leverages the cost-effective Gemini 1.5 Flash model, resulting in minimal expense. This innovative “video scraping” technique provides a practical alternative to conventional data extraction methods.

Perspectives

  • Duolingo CEO Luis von Ahn wants you addicted to learning. Duolingo’s CEO, Luis von Ahn, talks about utilizing AI and gamification to improve language learning through features such as chat interactions with AI avatars and AI-generated video game-like adventures. The company has recently launched Duolingo Max, a premium subscription plan that provides AI-driven conversation practice, capitalizing on the lower costs and faster development associated with AI-generated content. Although AI has limitations in engagement, Duolingo prioritizes maintaining user motivation by balancing effective learning with gamified, entertaining experiences.
  • State of AI Report 2024. The 2024 State of AI Report notes that foundational models are increasingly being integrated into practical applications, with OpenAI leading the way in significant revenue generation. Key developments include the alignment of performance among leading research labs, a growing emphasis on planning and reasoning in large language model (LLM) research, and extending foundational models into multimodal domains. Despite facing regulatory hurdles, AI companies have seen a surge in valuation, though questions about their long-term sustainability remain.
  • How gen AI can help doctors and nurses ease their administrative workloads. Doctors and nurses spend nearly 28 hours a week on administrative tasks.
  • Elon Musk’s global political goals. Over the weekend, Musk pledged to give away $1m a day to registered voters in battleground states in the US who sign his Pac’s petition in support of the First and Second Amendments. He awarded the first prize, a novelty check the size of a kitchen island, at a Pennsylvania rally on Saturday and the second on Sunday in Pittsburgh. He says he’ll keep doing it until the election on 5 November. Experts say that the stunt is potentially illegal.
  • The Second $100B AI Company. This article forecasts that by 2034, emerging AI companies fueled by advancements in AI applications, particularly in consumer AI, will join OpenAI in exceeding a $100B market cap. While established tech giants currently dominate the AI infrastructure and model layers, the application layer offers significant potential for innovation and expansion, providing fertile ground for consumer AI to flourish. The prospects for large-scale success in consumer AI, especially in areas such as video creation, online shopping, and gaming, resemble the transformative impact seen in past tech revolutions like cloud computing and mobile technology.
  • Use Prolog to improve LLM’s reasoning. Current methods such as Chain-of-Thought (CoT) reasoning and the integration of programming languages like Prolog can enhance the reasoning abilities of LLMs, helping to mitigate the limitations of autoregressive models. The paper “Reliable Reasoning Beyond Natural Language” introduces a neurosymbolic approach that employs Prolog to translate requests into symbolic logic, enhancing both explainability and problem-solving capabilities. ProSLM, the model developed in this research, has shown substantial improvements in various datasets, highlighting the potential of combining Prolog with LLMs for tackling complex reasoning tasks.
  • AI watermarking must be watertight to be effective. Scientists are closing in on a tool that can reliably identify AI-generated text without affecting the user’s experience. But the technology’s robustness remains a challenge.
  • AI scans RNA ‘dark matter’ and uncovers 70,000 new viruses. Many are bizarre and live in salt lakes, hydrothermal vents, and other extreme environments.
  • Build an international AI ‘telescope’ to curb the power of big tech companies. Artificial intelligence (AI) technologies have reached a crucial juncture. The vast computing clusters required to train the most advanced generative AI systems are available only to a few large corporations.
  • Was the Nobel prize for physics? Yes — not that it matters. The award of the 2024 Nobel Prize in Physics to John Hopfield and Geoffrey Hinton for their groundbreaking research on artificial neural networks has caused consternation in some quarters. Surely this is computer science, not physics?
  • How I peer into the geometry behind computer vision. Minh Ha Quang’s work at a Japanese AI research center aims to understand how machines extract image data from the real world.
  • AI Dreams: Microsoft @ 50, Chapter 1. Microsoft’s research on AI robustness led the company to invest billions in AI infrastructure, driving breakthroughs with partners such as OpenAI. This investment has played a key role in Microsoft’s rapid growth in AI-powered products, highlighted by the success of GitHub Copilot. Despite facing competition and balancing sustainability goals, Microsoft remains committed to AI, with record capital expenditures on its AI and cloud infrastructure.
  • Future of Internet in the age of AI. In this article, Cloudflare CEO Matthew Prince explores AI’s influence on Internet infrastructure, emphasizing the need for AI-capable edge computing and local inference to minimize network latency. He underscores the significance of regionalization in AI services to address regulatory challenges and outlines Cloudflare’s strategy of developing a connectivity-focused network. Cloudflare’s goal is to enhance internet connectivity by making it faster, more secure, and more efficient, closely aligning its efforts with advancements in AI technologies.
  • How Jacob Collier helped shape the new MusicFX DJ. Grammy-winning musician Jacob Collier has partnered with Google DeepMind and Google Labs to develop MusicFX DJ, an AI-driven music tool. The tool’s interface has been revamped to foster creativity, making it easy for users to tap into a “flow state” of artistic inspiration. MusicFX DJ is now available, featuring user-friendly controls suitable for all experience levels.
  • The AI Investment Boom. The AI boom is spurring substantial US investments in data centers, computing infrastructure, and advanced hardware, with annual data center construction reaching an unprecedented $28.6 billion. This growth is driven by rising demand for high-powered computing resources essential for training and deploying sophisticated AI models. Although tech sector revenue is recovering, job growth is primarily centered on semiconductor manufacturing and infrastructure, shifting attention away from traditional programming roles.

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

--

--

Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence