WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES
AI & ML news: Week 28 October — 3 November
Meta Introduces Spirit LM, Apple Launches Apple Intelligence on New iMac, Cohere’s Embed 3 Multimodal Search Model, Google’s Invisible Watermark for AI-Generated Text, ChatGPT can search the internet, and much more
The most interesting news, repository, articles, and resources of the week
Check and star this repository where the news will be collected and indexed:
You will find the news first in GitHub. All the Weekly News stories are also collected here:
Research
- A Theoretical Understanding of Chain-of-Thought. reveals that incorporating both correct and incorrect reasoning paths in demonstrations enhances the accuracy of intermediate steps and Chain-of-Thought (CoT) processes. The new approach, Coherent CoT, substantially boosts performance across multiple benchmarks. Specifically, Gemini Pro shows a 6.60% improvement on the Tracking Shuffled Objects dataset (rising from 58.20% to 64.80%), while DeepSeek 67B achieves a 6.17% increase on the Penguins in a Table dataset (from 73.97% to 80.14%).
- LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering. improves RAG’s comprehension of long-context knowledge, incorporating global insights and factual specifics. It features a hybrid retriever, an LLM-enhanced information extractor, a Chain-of-Thought (CoT) guided filter, and an LLM-augmented generator. These core components empower the RAG system to extract global long-context information and accurately capture factual details. LongRAG demonstrates superior performance, surpassing long-context LLMs by 6.94%, advanced RAG by 6.16%, and Vanilla RAG by 17.25%.
- Evaluating feature steering: A case study in mitigating social biases. examines feature steering in LLMs through an experiment that adjusts various features to observe shifts in model outputs, specifically focusing on 29 features related to social biases to determine if feature steering can reduce these biases. Findings reveal that while feature steering can sometimes cause unintended effects, incorporating a neutrality feature effectively reduces social biases across 9 social dimensions without compromising text quality.
- Large Language Models Reflect the Ideology of Their Creators. reveals that LLMs display varied ideological perspectives, often mirroring the worldview of their creators. It observes consistent normative differences in responses when the same LLM operates in Chinese versus English and highlights normative disagreements between Western and non-Western LLMs regarding prominent figures in geopolitical conflicts.
- Scalable watermarking for identifying large language model outputs. introduces SynthID-Text, a text-watermarking approach designed to maintain text quality in LLM outputs, achieve high detection accuracy, and reduce latency. It incorporates watermarking through speculative sampling, using a final score pattern for model word choices alongside adjusted probability scores. The authors evaluate the method’s feasibility and scalability by analyzing feedback on nearly 10 million Gemini responses.
- A Comparative Study on Reasoning Patterns of OpenAI’s o1 Model. outperformed other test-time computing methods across most datasets. The authors note that the primary reasoning patterns in o1 are divide and conquer and self-refinement, with the model adapting its reasoning strategy to specific tasks. For commonsense reasoning, o1 frequently employs context identification and focuses on constraints, while for math and coding tasks, it predominantly utilizes method reuse and divide-and-conquer approaches.
- Sparse Crosscoders for Cross-Layer Features and Model Diffing. Crosscoders are an advanced form of sparse autoencoders designed to enhance the understanding of language models’ internal mechanisms.
- Distill Visual Chart Reasoning Ability from LLMs to MLLMs. Code-as-Intermediary Translation (CIT) is an innovative technique aimed at improving visual reasoning in multimodal language models (MLLMs) by leveraging code to convert chart visuals into textual descriptions.
- Probabilistic Language-Image Pre-Training. Probabilistic Language-Image Pre-training (ProLIP) is a vision-language model (VLM) designed to learn probabilistically from image-text pairs. Unlike traditional models that rely on strict one-to-one correspondence, ProLIP captures the complex many-to-many relationships inherent in real-world data.
- A faster, better way to train general-purpose robots. MIT researchers have developed Heterogeneous Pretrained Transformers (HPT), a novel model architecture inspired by large language models, designed to train adaptable robots by utilizing data from multiple domains and modalities.
- A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs. In this work, DeepMind demonstrates how a small language model can be used to provide soft supervision labels and identify informative or challenging data points for pretraining, significantly accelerating the pretraining process.
- NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction. The NeuroClips framework introduces advancements in reconstructing continuous videos from fMRI brain scans by decoding both high-level semantic information and fine-grained perceptual details.
- Machine-guided design of cell-type-targeting cis-regulatory elements. A generalizable framework to prospectively engineer cis-regulatory elements from massively parallel reporter assay models can be used to write fit-for-purpose regulatory code.
News
- Keir Starmer says media firms should have control of the output used in AI. PM says content creators must be paid and vows to ensure technology ‘does not begin to chip away’ at press freedoms
- Waymo raises $5.6B. Waymo’s driverless taxi service has gained significant popularity. The company has secured additional funding to extend its reach beyond the current cities and millions of miles it already covers.
- Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs. Just in time for Halloween 2024, Meta has unveiled Meta Spirit LM, the company’s first open-source multimodal language model capable of seamlessly integrating text and speech inputs and outputs.
- IBM debuts open source Granite 3.0 LLMs for enterprise AI. IBM is enhancing its enterprise AI suite with Granite 3.0 LLMs, prioritizing open-source options and optimized performance. Available across various platforms, these models have built-in safety features and are customized for diverse enterprise applications. IBM highlights the significance of true open-source licensing with Apache 2.0, enabling flexible adoption and fostering enterprise-driven innovation.
- Microsoft introduces ‘AI employees’ that can handle client queries. US company gives customers the ability to build their own virtual agents as well as releasing 10 off-the-shelf bots
- Microsoft Excel’s bloopers reel: 40 years of spreadsheet errors. As the software used by millions around the world celebrates its birthday, here are some of the low points
- Google Expands Voice Technology Support to 15 More African Languages. Google has expanded voice recognition support to include 15 more African languages across its platforms, such as Voice Search, Gboard talk-to-type, and Translate dictation. This enhancement enables an estimated 300 million additional Africans to engage with digital content in their native languages.
- Cohere releases a state-of-the-art multimodal AI search model. Cohere has unveiled that its Embed 3 AI model is now multimodal, allowing for rapid and precise search across essential enterprise image data sources such as graphs, charts, product catalogs, and design files. This enhancement makes Embed 3 the most broadly capable multimodal embedding model available today.
- Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview. You can now access models like Claude, Gemini, and o1, among others, through GitHub Copilot.
- Apple releases the first batch of Apple Intelligence features and debuts the new iMac. Apple introduced new AI features, branded as Apple Intelligence, on its latest devices, focusing on text processing and photo editing capabilities. The updated iMac now runs on the M4 chip, which includes a Neural Engine that delivers three times the AI performance of previous models. Upcoming AI updates aim to improve Siri’s capabilities and incorporate ChatGPT to handle more advanced queries.
- How Advex creates synthetic data to improve machine vision for manufacturers. Advex AI addresses data shortages in AI training by leveraging generative AI to create synthetic images tailored for computer vision systems.
- Coframe raises $9 million for websites that optimize themselves using AI. AI startup Coframe has raised $9.3 million in seed funding to further develop its platform, which leverages generative AI to optimize websites and deliver personalized marketing experiences.
- Google unveils invisible ‘watermark’ for AI-generated text. Real-world demonstration in chatbot responses could encourage other firms to label material produced by AI.
- Reddit shares soar after company turns first-ever profit. Monthly users rose by nearly half thanks to the AI translation feature, and deals for AI training with Google and OpenAI boosted revenue
- Google parent Alphabet sees double-digit growth as AI bets boost cloud business. Analysts expected 12% year-on-year revenue gains, but the company reports 15%, buoyed by performance in ads and cloud services
- EU events on curbing big tech ‘distorted’ by attendees with industry links. Campaigners say 21% of people at workshops did not disclose on their application relationships with firms being discussed
- Indonesia blocks Apple iPhone 16 sales over lack of investment. Marketing and sale of model prohibited after tech giant fails to meet rule 40% of phones be made from local parts
- 25% of Smartphone Owners Don’t Want AI as Apple Intelligence Debuts. What’s a bigger priority? Longer battery life, according to a new CNET survey.
- Google preps ‘Jarvis’ AI agent that works in Chrome. Google’s Project Jarvis, powered by Gemini 2.0, aims to automate web-based tasks in Chrome by using AI agents capable of reasoning and planning.
- OpenAI’s Whisper transcription tool has hallucination issues, researchers say. OpenAI’s Whisper, an AI transcription tool, has been found to produce hallucinations — fabricated text not present in the original audio — even in medical settings. Despite OpenAI’s advisories against using Whisper in high-risk domains, over 30,000 medical professionals across 40 health systems have adopted it for transcribing patient consultations
- Forerunner K2 humanoid robot can carry 33 lb in each dexterous hand. Kepler has introduced the Forerunner K2, a humanoid robot featuring advanced AI, upgraded hardware, and enhanced vision and navigation systems for improved real-time interaction.
- Introducing ChatGPT search. ChatGPT now offers an improved web search capability, providing quick, current answers with links to relevant sources — answers you’d typically seek through a search engine. This feature combines the ease of a natural language interface with access to real-time information, such as sports scores, news, stock prices, and more.
- Advancing embodied AI through progress in touch perception, dexterity, and human-robot interaction. This work features several components, including vision-based tactical sensing, innovative hardware touch sensors, and noteworthy strategic partnerships within robotics.
- Elon Musk’s xAI adds image understanding capabilities to Grok. This means that paid users on his social platform X, who have access to the AI chatbot, can upload an image and ask the AI questions about it.
- OpenAI CFO Says 75% of Its Revenue Comes From Paying Consumers. OpenAI generates the vast majority of its revenue from consumers who pay for its products, Chief Financial Officer Sarah Friar said, even as the artificial intelligence startup competes in a crowded market to sign up more corporate customers.
- Hello Patient. Hello Patient has emerged from stealth mode, securing a $6.3 million seed funding round led by 8VC. The company, founded by Alex Cohen, is based in Austin, Texas.
- Google plans to announce its next Gemini model soon. December is shaping up to be a month of dueling announcements from OpenAI and Google.
- Meta is reportedly developing a search engine for its chatbot. The company wants to decrease Meta AI’s reliance on Google and Microsoft.
- A mysterious new image generation model has appeared. A mysterious new image generation model is beating models from Midjourney, Black Forest Labs, and OpenAI on the crowdsourced Artificial Analysis benchmark. The model, which goes by the name “red_panda,” is around 40 Elo points ahead of the next-best-ranking model, Black Forest Labs’ Flux1.1 Pro, on Artificial Analysis’ text-to-image leaderboard.
Resources
- Agentic Information Retrieval. offers an overview of agentic information retrieval, driven by the abilities of LLM agents; explores various advanced applications of agentic information retrieval and addresses related challenges.
- Aya Expanse. introduces a suite of open-weight foundation models designed for multilingual proficiency, featuring 8B and 32B parameter models and one of the largest multilingual datasets to date, containing 513 million examples. The release also includes Aya-101, which is claimed to be the most extensive multilingual model, supporting 101 languages. Aya Expanse 32B surpasses the performance of Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B, even though it is half the size of the latter.
- A Survey on Data Synthesis and Augmentation for Large Language Models. offers an in-depth overview of data generation techniques throughout the LLM lifecycle, covering topics such as data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and practical applications.
- granite-3.0-language-models. introduces a range of lightweight foundation models from 400 million to 8 billion parameters, optimized for tasks such as coding, retrieval-augmented generation (RAG), reasoning, and function calling. Designed for enterprise applications, these models support on-premise and on-device deployment, showing robust performance across academic benchmarks in language understanding, reasoning, coding, function calling, and safety.
- Pixtral-12B-Base-2409. Pixtral 12B base model weights have been released on Hugging Face.
- Arcade, a new AI product creation platform, designed this necklace. Arcade AI has developed a generative platform that allows users to create distinctive, high-quality jewelry items simply from text prompts — and the exciting part is, that you can purchase the designs you generate.
- Retrieval-Augmented Diffusion Models for Time Series Forecasting. The Retrieval-Augmented Time Series Diffusion model (RATD) introduces a retrieval and guidance mechanism to enhance stability and performance in time series diffusion models. RATD operates in two steps: first, it retrieves relevant historical data from a database, and then uses this information as a reference to guide the denoising phase.
- NotebookLlama: An Open Source version of NotebookLM. Meta has published a quick start guide to help users build a simplified version of Google’s popular NotebookLM system.
- How I Studied LLMs in Two Weeks: A Comprehensive Roadmap. This article presents a 14-day roadmap for mastering LLM fundamentals, covering key topics such as self-attention, hallucinations, and advanced methods like Mixture of Experts. It offers resources for building an LLM from the ground up, alongside curated literature and online materials, all organized within a GitHub repository. Emphasizing a tailored learning experience, the article underscores the importance of foundational skills in math, programming, and deep learning.
- Marly. Marly is an open-source data processor that enables agents to query unstructured data using JSON, streamlining data interaction and retrieval.
- LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias. It was previously believed that novel view synthesis depended heavily on strong 3D inductive biases. This study demonstrates that, with scale and a minimal inductive bias, it’s possible to significantly surpass these previously assumed limitations.
- Continuous Speech Synthesis using per-token Latent Diffusion. Autoregressive models continue to excel in many applications, yet recent advancements with diffusion heads in image generation have led to the concept of continuous autoregressive diffusion. This research broadens the scope of per-token diffusion to accommodate variable-length outputs.
- CDChat: A Large Multimodal Model for Remote Sensing Change Description. This paper presents a change description instruction dataset aimed at fine-tuning large multimodal models (LMMs) to enhance change detection in remote sensing.
- IC-Light V2 (Flux-based IC-Light models). IC Light currently offers the most effective method for associating images with a pre-trained text-to-image backbone. This discussion marks the initial steps toward expanding that capability to the robust Flux models.
- The Scene Language: Representing Scenes with Programs, Words, and Embeddings. Creating 3D scenes from scratch presents significant challenges, including data limitations. This research introduces a programming-like language for describing 3D scenes and demonstrates that Claude Sonnet can produce highly realistic scenes even without specific training for this task.
- 3D Semantic Segmentation. FtD++ is a cross-modal learning approach designed to enhance unsupervised domain adaptation in 3D semantic segmentation tasks.
- Open source replication of crosscoder on Gemma 2B. Anthropic recently published two studies showcasing its novel interpretability method. This post provides an open replication of the cross coder on the Gemma 2B model.
- Awesome-Graph-OOD-Learning. This repository lists papers on graph out-of-distribution learning, covering three primary scenarios: graph OOD generalization, training-time graph OOD adaptation, and test-time graph OOD adaptation.
- OpenWebVoyager: Building Multimodal Web Agents. OpenWebVoyager offers tools, datasets, and models designed to build multimodal web agents that can navigate and learn from real-world web interactions.
- Automated Colorization for Animation. Researchers have introduced an innovative inclusion-matching technique that overcomes challenges in automated colorization, particularly for animations where occlusions and wrinkles complicate traditional segment matching.
- Lofi Music Dataset. A dataset containing music clips paired with detailed text descriptions, generated by a music creation model.
- Learning to Handle Complex Constraints for Vehicle Routing Problems. Researchers have developed a Proactive Infeasibility Prevention (PIP) framework designed to enhance neural network performance on Vehicle Routing Problems (VRPs) that involve challenging constraints.
- Unleashing the Power of AI on Mobile: LLM Inference for Llama 3.2 Quantized Models with ExecuTorch and KleidiAI. PyTorch has made significant strides with ExecuTorch, a tool that enables AI model deployment at the edge, greatly enhancing the performance and efficiency of various end systems.
- CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution. CompassJudger-1 is the first open-source, comprehensive judge model created to enhance the evaluation process for large language models (LLMs).
- MINT-1T. MINT-1T, a vast open-source multimodal dataset, has been released with one trillion text tokens and 3.4 billion images, incorporating diverse content from HTML, PDFs, and ArXiv papers. This dataset, roughly ten times larger than previous collections, is intended to accelerate advancements in large-scale multimodal machine learning research.
- LARP: Tokenizing Videos 🎬 with a Learned Autoregressive Generative Prior 🚀. LARP is a novel video tokenizer designed to enhance video generation in autoregressive (AR) models by prioritizing global visual features over individual patch-based details.
- OpenAI’s new hallucination benchmark. OpenAI has released the SimpleQA benchmark, which measures models’ abilities around simple factual questions.
- ThunderKittens. Thunder Kittens is a framework designed for creating highly efficient GPU kernels. It leverages the principle that GPUs are optimized for working with compact 16x16 data tiles, resulting in high usability. With this approach, achieving 40% faster kernels requires only a few hundred lines of code.
- Skinned Motion Retargeting with Dense Geometric Interaction Perception. MeshRet has developed an innovative method for enhancing motion retargeting for 3D characters, prioritizing the preservation of body geometry interactions from the outset.
- Unlocking the Capabilities of Masked Generative Models for Image Synthesis via Self-Guidance.Researchers have improved Masked Generative Models (MGMs) by introducing a self-guidance sampling technique, which enhances image generation quality without compromising diversity.
- Speeding Up Transformers with Token Merging. This project presents PiToMe, an algorithm that compresses Vision Transformers by gradually merging tokens after each layer, thereby decreasing the number of tokens processed.
- PF3plat : Pose-Free Feed-Forward 3D Gaussian Splatting. PF3plat addresses the challenge of 3D reconstruction and novel view synthesis from RGB images without requiring additional data.
- Fine-tuning LLMs to 1.58bit: extreme quantization made easy. BitNet, created by Microsoft Research, presents a transformer architecture that lowers the computational and memory demands of large language models by employing ternary precision (-1, 0, 1), equating to 1.58 bits per parameter. This architecture requires models to be trained from scratch, but it can also fine-tune existing models to this low-precision format while retaining high performance on downstream tasks. This technique greatly reduces energy consumption and enhances inference speed through specialized kernels that enable efficient matrix multiplication.
- SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Recognition. SELECT is the inaugural extensive benchmark designed to evaluate various data curation methods in image classification. ImageNet++ is a newly developed dataset that augments ImageNet-1K by incorporating five additional training data variations, each curated through distinct techniques.
- ODRL: A Benchmark for Off-Dynamics Reinforcement Learning. ODRL is the first standardized benchmark designed to assess reinforcement learning methods in environments with differing dynamics.
- Text-to-Image Model to Generate Memes. Researchers have created an innovative adapter method for text-to-image models, enabling them to tackle complex tasks such as meme video generation while preserving the base model’s strong generalization abilities.
- Anomaly Classification in Industry. AnomalyNCD is a multi-class anomaly classification framework intended to enhance traditional anomaly detection techniques in industrial environments.
- MrT5: Dynamic Token Merging for Efficient Byte-level Language Models. Byte-level language models represent a move toward a token-free future, but the challenge of sequence length remains significant. Dynamically merging tokens can help increase the number of tokens within the context.
- BART vectoriZed. A new GPU-enabled implementation of Bayesian Additive Regression Trees (BART) significantly accelerates processing speed, making it up to 200 times faster than conventional CPU-based versions.
- Huge new Diffusers release. The Hugging Face Diffusers package now includes new pipelines like Flux, Stable Audio, Kolors, CogVideoX, Latte, and others, alongside new methods such as FreeNoise and SparseCtrl, plus various refactors.
- 4 experiments with voice AI models to help you explore culture. Google’s voice AI models allow users to engage with culture in innovative ways. Projects like Talking Tours provide AI-guided virtual tours, Mice in the Museum offers art narration, and Lip Sync animates lips to discuss cultural topics. These entertaining tools offer new perspectives on art and design.
Perspectives
- ByteDance intern fired for planting malicious code in AI models. After rumors swirled that TikTok owner ByteDance had lost tens of millions after an intern sabotaged its AI models, ByteDance issued a statement this weekend hoping to silence all the social media chatter in China.
- Thinking Like an AI. Large language models (LLMs) operate as advanced autocomplete systems, generating the next token based on a combination of their training data and current input. Small variations in input can influence predictions, resulting in different responses to the same question. Gaining insight into token prediction, training data context, and memory constraints can enhance effective AI usage.
- An Interview with Salesforce CEO Marc Benioff about AI Abundance. Salesforce CEO Marc Benioff recently spoke about the company’s new AI initiative, Agentforce, showcasing its potential to transform enterprise applications and customer interactions. He contrasted Salesforce’s approach with Microsoft’s Copilot, describing Salesforce’s solution as more cohesive and impactful, thanks to its strong platform and data infrastructure. During the interview, Benioff stressed the significance of AI-driven “agentic” layers designed to boost customer service and improve operational efficiency across various industries.
- How GPU Access Helps Startups Be Agile. Andreessen Horowitz’s Oxygen program tackles GPU shortages by offering startups in its portfolio more accessible and flexible GPU resources, allowing them to bypass price surges and supply limitations. This initiative enables AI startups to concentrate on product development without the pressure of long-term capital expenditure, emphasizing the need for equitable access to critical resources in the competitive AI field.
- The Mask Comes Off: At What Price? OpenAI is approaching its shift to a Public Benefit B-Corporation, a move that could impact its investor dynamics and collaboration with Microsoft. This transition brings up questions around control and valuation, particularly concerning the nonprofit’s stake, which could be substantial given OpenAI’s role in advancing AGI. The company’s future profitability and strategic course are closely tied to the safe development of AGI, a pursuit with enormous potential value.
- What’s so special about the human brain? Torrents of data from cell atlases, brain organoids, and other methods are finally delivering answers to an age-old question.
- ‘Educational’ apps are worth billions. We need to make sure they work. Partnerships between developers and researchers could help to improve the quality of educational apps and other technologies.
- The huge protein database that spawned AlphaFold and biology’s AI revolution. Pioneering crystallographer Helen Berman helped to set up the massive collection of protein structures that underpins the Nobel-prize-winning tool’s success.
- Extreme fire seasons are looming — science can help us adapt. Not all wildfires can be averted, but data, models, and collaborations can help to chart a course to a fire-resilient future.
- AI-designed DNA sequences regulate cell-type-specific gene expression. Researchers have used artificial intelligence models to create regulatory DNA sequences that drive gene expression in specific cell types. Such synthetic sequences could be used to target gene therapies to particular cell populations.
- Pushing the frontiers of audio generation. DeepMind has shared additional details about the audio generation models behind NotebookLM.
- Evaluating feature steering: A case study in mitigating social biases. This study investigates the use of feature steering in AI models to adjust outputs in an interpretable way. It identifies a “steering sweet spot,” where modifications do not compromise performance. Results demonstrate that steering can adjust social biases within specific areas but may also produce unintended effects outside those targets. Continued research is necessary to enhance feature steering, aiming for safer and more dependable AI outcomes.
- How we saved hundreds of engineering hours by writing tests with LLMs. Assembled leverages LLMs to speed up and enhance software testing, allowing tests to be generated in minutes rather than hours. This approach boosts engineering productivity, saving time and enabling a stronger focus on feature development. LLMs create thorough and precise tests that uphold code quality and sustain development speed.
- How to train LLM as a judge to drive business value.” LLM As a Judge” is an approach for leveraging an existing language model to rank and score natural language. This post provides guidelines for effectively using this method to process or assess data.
Meme of the week
What do you think about it? Some news that captured your attention? Let me know in the comments
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.
Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.
or you may be interested in one of my recent articles: