WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week

Spotify’s AI Innovations in Music, Podcasts, and Recommendations. AI Model Identifies Brain Tumors in 10 Seconds, US Justice Department Pushes Google to Sell Chrome, Breakthrough Robot Performs Surgeries After Watching Videos, and much more

Salvatore Raieli
21 min readJust now
Photo by Ian Maina on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

49 stories

Research

  • Artificial Intelligence, Scientific Discovery, and Product Innovation. indicates that leading scientists use their expertise to focus on the most promising AI-generated suggestions, while others often expend considerable resources on false positives; shows that adopting AI technology for materials discovery boosts productivity, resulting in 44% more materials discovered, a 39% increase in patent filings, and 17% greater product innovation; notes that these improvements come with drawbacks, as 82% of scientists experienced lower job satisfaction, citing reduced creativity and underutilization of their skills.
  • Scaling Laws for Precision. presents “precision-aware” scaling laws that forecast how both training and inference precision impact LLM performance; key insights include: 1) post-training quantization becomes increasingly detrimental as models are trained on larger datasets, to the point where more pretraining may harm performance, 2) training with lower precision necessitates a larger model size to sustain performance levels, and 3) when optimizing model size, data, and precision together, the ideal training precision is around 7–8 bits, independent of compute availability; further notes that with fixed model size, the optimal precision for compute increases roughly logarithmically with data size; the authors confirm their predictions on models up to 1.7B parameters trained on up to 26B tokens, demonstrating that both very high (16-bit) and very low (under 4-bit) training precisions may be inefficient.
  • Sequence modeling and design from molecular to genome-scale with Evo. a 7B parameter AI model built to comprehend and generate DNA sequences across various biological scales; trained on 2.7 million prokaryotic and phage genomes, it can handle sequences up to 131 kilobases long while preserving single-nucleotide precision, allowing it to capture both molecular interactions and genome-wide patterns; Evo outperforms in predicting and generating functional DNA, RNA, and protein sequences, achieving the first experimentally validated AI-generated CRISPR-Cas complexes and transposable systems.
  • The Surprising Effectiveness of Test-Time Training for Abstract Reasoning. examines test-time training (TTT), where model parameters are temporarily updated during inference, to enhance an LLM’s abstract reasoning on the ARC benchmark; highlights three essential components: initial fine-tuning on related tasks, using auxiliary task formats and augmentations, and per-instance training; TTT yields substantial performance gains, with accuracy improvements of up to 6x over base fine-tuned models; applying TTT to an 8B LLM results in 53% accuracy on ARC’s public validation set, a nearly 25% increase over the previous state-of-the-art for neural approaches; combining their method with program generation techniques achieves a new public validation accuracy of 61.9%, on par with average human performance; the results indicate that explicit symbolic search is not the sole route to better abstract reasoning in LLMs, and that test-time training on few-shot examples can be highly effective.
  • Toward Optimal Search and Retrieval for RAG. investigates the impact of retrieval on performance in RAG pipelines for QA tasks; performs experiments using BGE-base and ColBERT retrievers with LLaMA and Mistral, showing that incorporating more gold (relevant) documents enhances QA accuracy; observes that using approximate nearest neighbor search with lower recall has minimal performance impact while potentially boosting speed and memory efficiency; notes that introducing noisy or irrelevant documents consistently harms performance, refuting prior research claims; concludes that optimizing the retrieval of gold documents is essential for RAG effectiveness and that lower search accuracy can be a practical strategy.
  • Rapid Response: Mitigating LLM Jailbreaks with a Few Examples. presents a novel approach for defending LLMs against jailbreak attacks, emphasizing the rapid adaptation of defenses upon detecting new attacks rather than striving for perfect initial adversarial robustness; using a new benchmark, the top-performing method — fine-tuning an input classifier — reduced attack success rates by over 240x for known attack types and 15x for new variations after observing just one example of each attack strategy; shows that swiftly responding to emerging jailbreaks can be an effective alternative to traditional static defenses.
  • Solving the Travelling Salesman Problem. This study highlights the often underestimated value of the “heatmap + Monte Carlo Tree Search (MCTS)” method, demonstrating that well-tuned, straightforward heatmaps can surpass more sophisticated models.
  • Graph-based AI model maps the future of innovation. MIT researchers created an AI model that employs generative knowledge extraction and graph reasoning to detect intricate patterns across domains such as biology and music. The model efficiently generates knowledge maps from scientific literature, uncovering connections and proposing novel materials inspired by art. This method boosts interdisciplinary research by uncovering hidden insights and fostering innovative concepts for material design.
  • Teaching Video Models to Understand Time Like a Story. This paper presents NumPro, an innovative approach designed to assist Video Large Language Models in managing Video Temporal Grounding tasks.
  • Generative World Explorer. The Generative World Explorer (Genex) is a system capable of simulating exploration in 3D spaces through the generation and leveraging those simulations to enhance planning. It employs an ST-VAE and a diffusion pass for its imagination process, leading to better planning outcomes.
  • Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering. The Generative World Explorer (Genex) is a system capable of simulating exploration in 3D spaces through the generation and leveraging those simulations to enhance planning. It employs an ST-VAE and a diffusion pass for its imagination process, leading to better planning outcomes.
  • OneNet: A Channel-Wise 1D Convolutional U-Net. OneNet is a 1D convolutional encoder optimized for efficient image segmentation, making it well-suited for edge devices.
  • AI’s math problem: FrontierMath benchmark shows how far technology still has to go. Artificial intelligence systems may be good at generating text, recognizing images, and even solving basic math problems — but when it comes to advanced mathematical reasoning, they are hitting a wall. A groundbreaking new benchmark, FrontierMath, exposes just how far today’s AI is from mastering the complexities of higher mathematics.
  • Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus. Researchers have proposed Additional Logic Training to enhance reasoning in LLMs, focusing on teaching them to manage complex deductions involving varied rules and distractions.
  • Solving Cold Starts in Adaptive Testing. The “cold start” issue in adaptive testing arises when initial questions fail to align with examinees’ abilities. Researchers have addressed this with the Diffusion Cognitive States Transfer Framework (DCSR), which employs diffusion models to utilize prior learning data across domains.
  • samurai. Tracking a consistent object over an extended period is a challenging task. This work enhances SAM 2 by integrating motion-aware memory banks, ensuring consistency over time and through occlusions. It stands out as one of the most effective visual tracking systems developed so far.
  • Compress and Reconstruct Images. PCNet is a new compact network for image-compressed sensing. It reduces sampling costs while delivering high-quality reconstructions.
  • LMM-driven Semantic Image-Text Coding for Ultra Low-bitrate Learned Image Compression. Large multi-modal models can generate captions and compress images simultaneously within a single system

News

Resources

  • OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models. introduces OpenCoder, a completely open-source LLM tailored for code generation and comprehension; the authors highlight key elements for creating top-performing code LLMs: (1) rigorous data cleaning using code-specific heuristic rules for deduplication, (2) effective recall of related text corpus for code context, and (3) high-quality synthetic data utilized in both annealing and supervised fine-tuning phases; OpenCoder outperforms previous open models at the 6B+ parameter level and provides not only the model weights but also the full training pipeline, datasets, and protocols to support reproducible research.
  • A Taxonomy of AgentOps for Enabling Observability of Foundation Model-based Agents. examines AgentOps platforms and tools, emphasizing the necessity of robust observability and traceability features to maintain reliability in foundation model-based autonomous agent systems throughout their development and production lifecycle.
  • Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models. presents Mixture-of-Transformers (MoT), a novel sparse multi-modal transformer architecture that achieves performance comparable to traditional models while using nearly half the computational resources for text and image tasks; MoT matches the performance of a dense baseline while utilizing only 55.8% of the FLOPs.
  • HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems. introduces a novel approach that uses HTML instead of plain text for constructing RAG systems; the core insight is that preserving HTML structure retains richer semantic and structural information compared to plain text conversion, which often loses critical formatting like headings, tables, and semantic tags; to handle the challenge of long HTML documents exceeding LLM context windows, the authors design a two-step pruning method: first, cleaning unnecessary HTML elements to cut length by 94%, and then applying a block-tree-based pruning approach that integrates embedding-based and generative pruning to retain essential content; experiments on six QA datasets show that HtmlRAG surpasses existing plain-text methods, confirming the benefits of maintaining HTML structure in RAG systems.
  • LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models. NVIDIA has developed LLaMA-Mesh, a method that fine-tunes the LLaMA language model to generate 3D meshes from text prompts. By training LLaMA on a curated dataset of 3D dialogues, LLaMA-Mesh enables the model to represent and generate 3D mesh data in plain text format, integrating 3D mesh generation with language understanding.
  • Your Fixed Watermark is Fragile: Towards Semantic-Aware Watermark for EaaS Copyright Protection. Researchers have introduced the Semantic Perturbation Attack (SPA) to exploit vulnerabilities in current watermarking schemes for Embedding-as-a-Service (EaaS) systems. Traditional watermarking methods often inject fixed signals into embeddings, regardless of the input’s semantics, making them susceptible to adaptive attacks. SPA leverages semantic perturbations to identify and bypass these static watermark signals, effectively compromising watermark verification.
  • Don’t Look Twice: Faster Video Transformers with Run-Length Tokenization. By adaptively caching video tokens that remain unchanged across frames, you can significantly accelerate run time without sacrificing performance or requiring extra training.
  • Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement. An improved technique for generating images with improved control based on chosen regions.
  • Accurate Image Matching. MOP+MiHo+NCC is a non-deep, modular method for improving image matches using a combination of three techniques. Multiple Overlapping Planes (MOP) clusters inlier matches and use RANSAC to remove outliers. Middle Homography (MiHo) minimizes distortion during planar reprojection. Normalized Cross Correlation (NCC) adjusts keypoint positions post-transformation.
  • The Beginner’s Guide to Visual Prompt Injections. Visual prompt injections present security threats to LLMs like GPT-4V by embedding harmful instructions within images, potentially causing unintended model behavior. These vulnerabilities can manipulate outputs, for instance, by causing the model to overlook certain individuals in images or misrepresent described contexts. With the increasing adoption of generative AI, companies must implement strong security measures to address these risks.
  • PyGen: Turning Your Ideas into Python Package. PyGen simplifies the process of turning your ideas into software, making coding more accessible and enjoyable. Leveraging advanced language models, PyGen acts like a tech-savvy assistant, transforming abstract concepts into complete Python tools, including testing and documentation.
  • UltraVox Audio Language Models. A suite of open-weight models that can take text and audio as input modalities.
  • Pixtral large. Pixtral Large is a 124B open-weight multimodal model built upon Mistral Large 2. As the second model in this multimodal series, it showcases advanced image comprehension, capable of interpreting documents, charts, and natural images, while retaining the top-tier text understanding of Mistral Large 2.
  • LLaVA-o1: Let Vision Language Models Reason Step-by-Step. Although this isn’t an exact replication of the training process used for o1, it remains a robust VLM trained on reasoning traces.
  • CLIP for Semantic Segmentation. Although CLIP has excelled in open-vocabulary tasks, it faces challenges in semantic segmentation due to noisy features and limited resolution. Trident tackles the resolution problem with a training-free framework, integrating CLIP and DINO features from sub-images and employing SAM’s encoder for global feature aggregation.
  • Confidence-aware Denoised Fine-tuning of Off-the-shelf Models for Certified Robustness. This work focuses on improving the certified robustness of smoothed classifiers by fine-tuning off-the-shelf models
  • ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning. This paper from Google demonstrates a method for altering the camera viewpoint of an existing video.
  • Evaluating-Constitutions. Code to assist in evaluating constitutions based on human feedback.
  • StableV2V: Stabilizing Shape Consistency in Video-to-Video Editing. StableV2V is a novel video editing framework that maintains shape consistency across frames, even when user prompts require significant transformations. This method ensures smooth and precise modifications throughout the video, preserving structural integrity
  • CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset. CCExpert is an AI model developed to describe changes in images using natural language. It can identify what has changed, where the change occurred, and how it happened.
  • SAM Decoding: Speculative Decoding via Suffix Automaton. SAM-Decoding offers a faster method for text generation in LLMs by utilizing a suffix automaton to create drafts efficiently and accurately.
  • That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design. DeepMind has issued a robust defense of its AlphaChip project, which has faced criticism from some academic circles despite widespread industry adoption. In a recent paper titled “That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design,” DeepMind addresses these critiques, emphasizing AlphaChip’s significant contributions to chip design. The paper highlights AlphaChip’s role in creating superhuman chip layouts for Google’s Tensor Processing Units (TPUs) and its influence on the hardware used globally.
  • PoM: Efficient Image and Video Generation with the Polynomial Mixer. Polynomial Mixer offers a faster and more memory-efficient alternative to Multi-Head Attention (MHA) in diffusion models used for image and video generation.
  • Cross-View Geo-Localization. Researchers have created a framework to address the challenges of cross-view geo-localization, including variations in viewpoints and large-scale global contexts.
  • A statistical approach to model evaluations. When two models are evaluated on a benchmark, declaring one as superior to the other often lacks strong confidence. This research from Anthropic introduces robust statistical methods to reliably determine when one model genuinely outperforms the other.
  • Software is a team sport. GitHub Copilot, utilized by over 2.8 million developers, enhances the development experience with AI-powered features such as code completion, debugging, and secure code reviews. Developers can select AI models from providers like OpenAI and Google within Visual Studio Code. Integration with Azure and tools like GitHub Actions streamlines cloud deployments and continuous integration/continuous deployment (CI/CD) processes.
  • Prompt Injecting Your Way To Shell: OpenAI’s Containerized ChatGPT Environment. This article examines the interactive features of OpenAI’s Debian-based sandbox environment for ChatGPT, revealing surprising details about its structure. Users can run Python scripts, manage files, and possibly expose core instructions through prompt engineering. These capabilities have sparked debates around transparency and privacy. While designed as intentional features, OpenAI does not consider them security vulnerabilities unless they result in breaches of the sandbox environment.

Perspectives

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

--

--

Salvatore Raieli
Salvatore Raieli

Written by Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence

No responses yet