WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 25 November — 1 December

Amazon’s $4 Billion Investment in Anthropic, OpenAI’s Sora Video Generator Leak, OLMO2, AI Fixes Quantum Errors, ElevenLabs Introduces Podcast Creation Feature, Alibaba Releases QwQ-32B-Preview Model, and much more

Salvatore Raieli
20 min readDec 3, 2024
Photo by Roman Kraft on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. All the Weekly News stories are also collected here:

Weekly AI and ML news - each week the best of the field

49 stories

Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.

Research

  • Learning high-accuracy error decoding for quantum processors. A new AI-driven decoder has established a state-of-the-art benchmark for detecting errors in quantum computers. Leveraging transformer architecture, AlphaQubit achieved a 6% reduction in errors compared to tensor network methods and a 30% reduction compared to correlated matching on the Sycamore data. It also demonstrated promising performance in simulations with larger systems of up to 241 qubits. While this marks substantial progress in quantum error correction, the system requires speed enhancements to enable real-time error correction for practical quantum computing applications.
  • The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use. This work examines Claude 3.5’s computer use capabilities across various domains and software, offering a ready-to-use agent framework for deploying API-based GUI automation models. Claude 3.5 showcases an exceptional ability to perform end-to-end tasks, translating language inputs into desktop actions seamlessly.
  • Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations. The paper proposes five statistical recommendations for improving the evaluation of performance differences in LLMs. These include using the Central Limit Theorem to estimate theoretical averages over all possible questions rather than relying on observed averages, clustering standard errors when questions are related instead of treating them as independent, reducing variance within questions through resampling or next-token probabilities, analyzing paired differences between models by leveraging shared questions across evaluations, and conducting power analysis to determine sufficient sample sizes for identifying meaningful differences. The authors suggest that these approaches will help researchers better identify whether performance differences reflect genuine capability gaps or are merely due to chance, resulting in more accurate and reliable model evaluations.
  • Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions. Marco-o1 is a reasoning model designed for open-ended solutions, leveraging Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and advanced reasoning strategies. It achieves accuracy gains of +6.17% on the MGSM (English) dataset and +5.60% on the MGSM (Chinese) dataset.
  • Cut Your Losses in Large-Vocabulary Language Models. The paper introduces Cut Cross-Entropy (CCE), a method designed to drastically reduce memory usage in LLM training by optimizing the computation of cross-entropy loss. Traditional cross-entropy layers can consume up to 90% of memory in some models by storing logits for the entire vocabulary. CCE addresses this by calculating logits only for the correct token and dynamically evaluating the log-sum-exp overall logits using flash memory. This approach reduces the memory footprint of Gemma 2 from 24GB to just 1GB. By leveraging the sparsity in softmax calculations, it skips elements that have minimal impact on gradients. The authors demonstrate that CCE achieves this substantial memory reduction without affecting training speed or convergence, allowing for larger batch sizes and potentially more efficient scaling of LLM training.
  • AIGS: Generating Science from AI-Powered Automated Falsification. The study presents a multi-agent system for automated scientific discovery, focusing on falsification through automated ablation studies. Tested on three machine learning tasks — data engineering, self-instruct alignment, and language modeling — the system successfully generated meaningful scientific insights. However, its performance remains inferior to that of experienced human researchers.
  • Does Prompt Formatting Have Any Impact on LLM Performance? The study investigates how different prompt formats (plain text, Markdown, JSON, and YAML) influence GPT model performance across various tasks. It finds that GPT-3.5-turbo’s performance can vary by up to 40% depending on the format, whereas larger models like GPT-4 are more resilient to such changes. There is no universally optimal format across models or tasks; for example, GPT-3.5-turbo performed better with JSON, while GPT-4 favored Markdown. Models within the same family exhibited similar format preferences, but these preferences did not translate well to different model families. The findings highlight the significant impact of prompt formatting on model performance, emphasizing the importance of considering format choice during prompt engineering, model evaluation, and application development.
  • Juna.ai wants to use AI agents to make factories more energy-efficient. AI agents are all the rage, a trend driven by the generative AI and large language model (LLM) boom these past few years. Getting people to agree on what exactly AI agents are is a challenge, but most contend they are software programs that can be assigned tasks and given decisions to make — with varying degrees of autonomy.
  • Why ‘open’ AI systems are actually closed, and why this matters. This paper examines ‘open’ artificial intelligence (AI). Claims about ‘open’ AI often lack precision
  • Qwen’s first reasoning-inspired model QwQ. Qwen has introduced a 32B parameter reasoning model that rivals OpenAI’s o1 series in performance. The model demonstrates scalability when generating extended reasoning traces and is proficient in mathematics and coding. It is now available for use.
  • Pathways on the Image Manifold: Image Editing via Video Generation. In the early days of image synthesis, exploring the latent space was an effective method for creating diverse images. This concept has now extended to video, enabling sequential edits to a single image while preserving semantic consistency.
  • Low-Bit Quantization Favors Undertrained LLMs. Models trained for shorter durations on fewer tokens show less performance degradation when quantized after training. This aligns with findings from other research, suggesting that extended training allows models to utilize higher precision to compress increasingly complex information.

News

Resources

  • An Empirical Study on LLM-based Agents for Automated Bug Fixing. The study evaluates seven top LLM-based bug-fixing systems on the SWE-bench Lite benchmark, identifying MarsCode Agent by ByteDance as the best performer with a 39.33% success rate. It highlights that line-level fault localization accuracy is more crucial than file-level accuracy for error localization, and bug reproduction capabilities play a significant role in fixing success. Notably, 24 out of 168 resolved issues required reproduction techniques, though these sometimes misled LLMs when issue descriptions were already clear. The study concludes that improving LLM reasoning abilities and refining agent workflows are essential for advancing automated bug fixing.
  • FinRobot: AI Agent for Equity Research and Valuation with Large Language Models. The framework introduces an AI agent system for equity research that utilizes multi-agent Chain-of-Thought (CoT) prompting to integrate data analysis with human-like reasoning, producing professional investment reports comparable to those from major brokerages. It employs three specialized agents: the Data-CoT Agent, which aggregates diverse data sources for comprehensive financial integration; the Concept-CoT Agent, which mimics an analyst’s reasoning to derive actionable insights; and the Thesis-CoT Agent, which synthesizes these insights into a cohesive investment thesis and report.
  • Bi-Mamba: Towards Accurate 1-Bit State Space Models. The scalable 1-bit Mamba architecture is designed to optimize LLM efficiency across multiple model sizes (780M, 1.3B, and 2.7B). Bi-Mamba delivers performance comparable to full-precision formats like FP16 and BF16, while drastically reducing memory usage. It also achieves higher accuracy than post-training binarization Mamba baselines.Ai2
  • OpenScholar: Scientific literature synthesis with retrieval-augmented language models. Ai2 has introduced OpenScholar, a retrieval-augmented language model designed to search for relevant academic papers and provide answers based on those sources, streamlining the process for scientists to locate and synthesize information.
  • Detecting Human Artifacts from Text-to-Image Models. This study addresses the issue of distorted human figures in text-to-image models by presenting the Human Artifact Dataset (HAD), a comprehensive dataset containing more than 37,000 annotated images.
  • UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages. UnifiedCrawl is a method that efficiently gathers extensive text data for low-resource languages from the Common Crawl corpus, utilizing minimal computational resources. This approach filters and extracts relevant data, resulting in monolingual datasets significantly larger than previously available sources.
  • A New Image-to-Video Model. Researchers have created image-to-video diffusion models capable of generating realistic motion transformations from static images, overcoming the constraints of traditional approaches such as affine transformations.
  • AIMv2: New Vision Models. The AIMv2 vision model family employs a multimodal autoregressive training approach, delivering remarkable performance across various tasks.
  • A New Attention Mechanism for Training LLMs. AnchorAttention: Improved attention for LLMs long-context training
  • Combining Convolutions and Self-Attentions for Efficient Vision Models. GLMix is a novel approach that combines convolutions and multi-head self-attentions (MHSAs) at varying granularity levels for vision tasks. Convolutions capture fine-grained local details, while MHSAs focus on coarse-grained semantic slots to provide global context.
  • Echo Mimic v2. Open weights system to animate partial human bodies with a reference image and audio input. It uses pose-specific VAEs to combine the information from various channels and a reference image to animate.
  • LTX-Video. LTX-Video is the first DiT-based video generation model that can generate high-quality videos in real-time. It can generate 24 FPS videos at 768x512 resolution, faster than it takes to watch them. The model is trained on a large-scale dataset of diverse videos and can generate high-resolution videos with realistic and diverse content.
  • Documind. Documind utilizes AI to extract structured data from PDFs by converting them into images and leveraging OpenAI’s API.
  • Coalescence: making LLM inference 5x faster. ”Coalescence” is a framework that accelerates LLM inference by up to 5x when producing structured outputs like JSON. It achieves this by transforming structured formats into finite-state machines and eliminating redundant paths that result in the same output, reducing the need for unnecessary LLM calls. Although this approach greatly enhances speed, it is crucial to preserve output quality by ensuring that optimization does not exclude more likely sequences.
  • WildLMa: Long Horizon Loco-Manipulation in the Wild. WildLMa is a framework designed to enable quadruped robots to perform advanced manipulation tasks in real-world settings. It integrates three core components: a whole-body controller for teleoperation via VR, a skill library learned through imitation learning (WildLMa-Skill), and a language model-based planner (WildLMa-Planner) that organizes these skills for long-term tasks. The researchers showcase its application in tasks such as cleaning trash from hallways and rearranging bookshelf items. The framework proves effective across various environments and object setups.
  • MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective. MMGenBench is a novel evaluation framework for large multimodal models, emphasizing their capacity to generate and interpret images. In this process, models produce descriptions from input images, which are subsequently used to generate new images for comparison.
  • Moondream Python Client Library. Moondream’s Python client library provides tools for image analysis and querying, featuring CPU-optimized inference. However, it is not yet suitable for GPU or Mac M1/M2/M3 users. The library can be installed using pip, and model weights are available for download in various formats, including int8, fp16, and int4.
  • Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer. Sana is a highly efficient image generation model capable of producing high-quality 1024x1024 images in under a second on a laptop GPU. Its innovations include a 32x image compression autoencoder (DC-AE), linear attention replacing traditional attention in DiT, a decoder-only LLM for text encoding, and improved training and sampling techniques. The 0.6B parameter model rivals or surpasses much larger models like Flux-12B, despite being 20x smaller and 100x faster. Requiring only 9GB of VRAM for inference, Sana-0.6B is accessible on consumer hardware. The repository provides code for training, inference, and evaluation, offering both 0.6B and 1.6B model variants.
  • Flow Models. A great introduction to flow-based modeling, which is a theoretical improvement over diffusion.
  • Building an AI-Powered Game. This is a course by Andrew Ng, Latitude, and Together AI on how to make an AI-powered game.
  • Sharper Infrared Images. This project improves image super-resolution for infrared images, addressing issues where traditional methods distort spectral fidelity.
  • Mochi 1 LoRA Fine-tuner. Mochi 1, a top open-source video model, supports LoRA fine-tuning and operates on a single GPU. The repository demonstrates various applications, such as creating custom effects and ensuring character consistency.
  • OneDiffusion. OneDiffusion is a versatile large-scale diffusion model capable of handling various tasks, including text-to-image generation, image editing, and reverse processes such as depth estimation and segmentation.
  • customized-flash-attention. New flash attention fork that can have ragged Q/V matrix sizes.
  • Novel View Synthesis. MVGenMaster is a multi-view diffusion model that enhances Novel View Synthesis tasks by incorporating 3D priors.
  • FlowMol: Mixed Continuous and Categorical Flow Matching for 3D De Novo Molecule Generation. This work benchmarks discrete flow matching methods for generating novel 3D molecular structures, critical for chemical discovery.
  • From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge. This project investigates the growing “LLM-as-a-judge” approach, where large language models are utilized for scoring, ranking, and selection tasks in diverse AI and NLP applications.
  • aisuite. An easy way to work with a variety of API based models in a single packaged environment.
  • UK government failing to list use of AI on the mandatory register. Technology secretary admits Whitehall departments are not being transparent over the way they use AI and algorithms
  • Reddit overtakes X in popularity of social media platforms in UK. Discussion platform takes fifth place in rankings and is the fastest growing large social media platform in the UK
  • Star Attention: Efficient LLM Inference over Long Sequences. Star Attention introduces a block-sparse method to accelerate Transformer-based large language models (LLMs) during long-sequence inference.
  • SketchAgent. SketchAgent utilizes a multimodal LLM to enable language-guided, step-by-step sketch generation using an intuitive sketching language. It can create diverse sketches, interact with humans for collaborative sketching, and edit content through chat.
  • DROID-Splat. A deep learning-based dense visual SLAM framework capable of real-time global pose optimization and 3D reconstruction.
  • P2DFlow. P2DFlow is a protein ensemble generative model with SE(3) flow matching based on ESMFold, the ensembles generated by P2DFlow could aid in understanding protein functions across various scenarios.
  • ThunderMittens For Your ThunderKittens. Hazy Research has played a significant role in optimizing hardware utilization for AI workloads. They have expanded their impressive ThunderKittens Kernel writing framework to support Apple Silicon.
  • DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving. Diffusion models for End-to-End driving of autonomous vehicles which can operate at 45 FPS on a 4090 chip.
  • PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion-based Image Super-Resolution. PassionSR introduces an approach that makes diffusion-based image super-resolution (SR) models more hardware-friendly.
  • Training Open Instruction-Following Language Models. This repo serves as an open effort on instruction-tuning popular pre-trained language models on publicly available datasets.
  • Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment. Grounding-IQA is an innovative method for image quality assessment (IQA) that combines location-specific grounding with multimodal descriptions.
  • Steel Browser API for AI Agents. The open-source browser API built for AI agents. Steel provides a REST API to control headless browsers with session management, proxy support, and anti-detection features. Perfect for web automation, scraping, and building AI agents that can interact with the web.
  • PixMo dataset. Allen AI has released several datasets that were used to train its visual language models.
  • StableAnimator: High-Quality Identity-Preserving Human Image Animation. StableAnimator introduces a breakthrough in human image animation by ensuring identity consistency in generated videos.

Perspectives

  • Jeff Jarvis: ‘Elon Musk’s investment in Twitter seemed insane, but it gave him this power’.The US media pundit on the dangers of overregulation online, why he’s more frightened of the tech bros than AI itself, and how to reclaim the web by getting rid of the geeks
  • Passwords are giving way to better security methods — until those are hacked too, that is. It’s a war that will never end. But for small-business owners, it’s all about managing risk while reaping rewards
  • Gwern Branwen — How an Anonymous Researcher Predicted AI’s Trajectory. In this post, Gwern Branwen, an early advocate of LLM scaling, explores AI advancements and their influence on the path to AGI. He highlights the significance of scaling and computational power over traditional algorithmic innovations. Branwen reflects on the interplay between human intelligence and AI, as well as the societal implications of upcoming technologies like weight-loss drugs on behavior. Additionally, he offers thoughts on his writing process and the transformative effects of AI on creative endeavors
  • The Bitter Religion: AI’s Holy War Over Scaling Laws. The AI community is currently divided over the emphasis on scaling computation as the primary driver of AI performance, a concept often referred to as “The Bitter Lesson.” Proponents, including leaders at OpenAI, believe that achieving artificial general intelligence (AGI) is possible shortly through the continued scaling of computational resources. However, others argue that alternative scientific advancements are necessary, as scaling laws may not be sustainable in the long term. This debate significantly influences investment and development strategies within AI and related fields.
  • Why LLMs Within Software Development May Be a Dead End. LLMs in software development face challenges due to their lack of decomposability and explainability.
  • How the far right is weaponizing AI-generated content in Europe. Experts say fake images raising fears around issues such as immigration have proliferated since EU elections
  • ‘What many of us feel’: why ‘enshittification’ is Macquarie Dictionary’s word of the year. The committee’s honorable mentions went to ‘right to disconnect’ and ‘rawdogging’
  • Valuing Humans in the Age of Superintelligence: HumaneRank. AI’s ability to exceed human intellectual output could result in economic displacement. The proposed Humanerank system addresses this by allowing individuals to allocate endorsements that represent societal value, influencing resource distribution. This approach preserves market dynamics and personal freedom while offering a new way to value human contributions in an AI-driven world.
  • Something weird is happening with LLMs and chess. This article examines how various LLMs perform in playing chess. Most models falter after a few moves, except for GPT-3.5-turbo-instruct, which excels. This indicates that instruction tuning might impair chess capabilities or that GPT-3.5-turbo-instruct was trained on more chess-related data. Additionally, tokenizer handling issues could be affecting model performance.
  • Amazon, Google and Meta are ‘pillaging culture, data and creativity’ to train AI, Australian inquiry finds. Among the report’s 13 recommendations is the call for the introduction of standalone AI legislation and protections for creative workers
  • When we become cogs. AI enhances material scientists’ efficiency, driving a 44% rise in material discoveries but reducing work satisfaction by 44% due to fewer opportunities for idea generation. Similarly, GitHub Copilot boosts productivity for less experienced developers, shifting their focus from project management to coding. While AI helps bridge skill gaps, it risks alienation by automating creative tasks, mirroring the effects of automation in other industries.
  • AI Alone Isn’t Ready for Chip Design. Hybrid methods blending classical search techniques with machine learning are proving effective in addressing the challenges of chip design, especially in floorplanning. While AI alone faces difficulties with multi-constraint scenarios, incorporating AI to guide search-based algorithms, such as simulated annealing, improves both efficiency and performance. This synergy accelerates the design process and facilitates the development of more intricate chip solutions.
  • In the big data era, prioritize statistical significance in study design. Analysis of neuroimaging studies shows that close attention to experimental design can increase the statistical robustness of research results.
  • AI could pose pandemic-scale biosecurity risks. Here’s how to make it safer. AI-enabled research might cause immense harm if it is used to design pathogens with worrying new properties. To prevent this, we need better collaboration between governments, AI developers, and experts in biosafety and biosecurity.
  • Don’t let watermarks stigmatize AI-generated research content. Given the increasing integration of LLMs into research processes, identifying their contributions transparently is ever more urgent. But watermarking risks fostering a reductive and binary view of content as either ‘pure’ or ‘tainted’ depending on whether it is human- or LLM-generated.
  • It’s Surprisingly Easy to Jailbreak LLM-Driven Robots. RoboPAIR is an algorithm capable of bypassing safety guardrails in robots powered by LLMs, effectively jailbreaking these systems. Tests demonstrated a 100% success rate in compromising platforms like the Go2 self-driving simulator and robot dogs. This highlights critical security vulnerabilities, underscoring the urgent need for stronger defenses against LLM-based robot hacking.
  • A new AI scaling law shell game? Recent changes in AI scaling laws have exposed limits in predictability and effectiveness, with newer models falling short of previous expectations. Microsoft CEO Satya Nadella emphasizes “inference time compute” as a key area to address, though issues of cost and reliability remain. Advancing beyond scaling is essential, and LLMs should be integrated into a more comprehensive AI strategy.

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

--

--

Salvatore Raieli
Salvatore Raieli

Written by Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence

Responses (1)