WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES
AI & ML news: Week 7 -13 October
AI get 2 nobel prizes, OpenAI’s DevDay Introduces Realtime API for AI Developers, Google Adds Ads to AI Overviews, and much more
The most interesting news, repository, articles, and resources of the week
Check and star this repository where the news will be collected and indexed:
You will find the news first in GitHub. All the Weekly News stories are also collected here:
Research
- A multimodal generative AI copilot for human pathology. PathChat is a vision-language AI assistant designed for pathology, combining a foundational vision encoder and a large language model, achieving state-of-the-art performance on diagnostic tasks and outperforming other multimodal AI systems, with potential applications in education, research, and clinical decision-making.
- Meta Movie Gen. Meta has developed a cutting-edge movie model with 30 billion parameters, which required 6,144 H100 GPUs for training. The model was trained using 1 billion images and 100 million carefully selected videos. Notably, it is based on a Temporal Autoencoder and incorporates Flow matching Llama. Meta also published a highly detailed 92-page research paper, making it one of the most comprehensive reports on the subject.
- When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1. Large language models face limitations because they rely on next token prediction. Although OpenAI’s o1 model was trained with a new objective focused on reasoning traces, it still exhibits some of the same constraints associated with next token prediction.
- Contextual Document Embeddings. This paper presents a method similar to a neutral TF/IDF, as it gathers information from the entire corpus rather than relying on individual document embeddings. It effectively captures contextual information from surrounding documents and has achieved state-of-the-art results on the MTEB benchmark.
- PairDistill: Pairwise Relevance Distillation for Dense Retrieval. This project introduces a novel technique called Pairwise Relevance Distillation (PairDistill), aimed at enhancing the accuracy of dense retrieval methods.
- Modeling relationships to solve complex problems efficiently. Associate Professor Julian Shun develops high-performance algorithms and frameworks for large-scale graph processing.
- Factual Accuracy in AI. Integrative Decoding is a technique designed to improve the factual accuracy of large language models, particularly for open-ended tasks. This method helps ensure more reliable and accurate outputs by refining the model’s ability to integrate information during generation.
- Dynamic Diffusion Transformer. The Dynamic Diffusion Transformer (DyDiT) improves the efficiency of diffusion models in image generation by building on the Diffusion Transformer (DiT). It achieves this by dynamically adjusting computational resources across different timesteps and spatial regions, minimizing redundancy and optimizing performance.
- Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach. The Frame-Aware Video Diffusion Model (FVDM) enhances video generation by overcoming the limitations of existing models. Instead of using a single timestep for the entire video clip, FVDM introduces a vectorized timestep variable, enabling each frame to follow its own noise schedule. This approach improves the quality and coherence of generated videos.
- What Matters for Model Merging at Scale? Model merging is a technique that allows the combination of two models to achieve the performance benefits of both. However, it does not always scale effectively with larger model sizes. This paper investigates the requirements and challenges for making model merging work efficiently with very large models, addressing issues related to scalability, performance trade-offs, and optimal merging strategies.
- nGPT: Normalized Transformer with Representation Learning on the Hypersphere. A significant amount of research effort is focused on normalizing the internal representations of language models. This study demonstrates that by placing every internal vector on a hypersphere, convergence time is significantly reduced for models of reasonable size, leading to more efficient training.
- Genomic Foundation Model Benchmarking. GFMBench is a newly developed framework aimed at tackling challenges in the development of genomic foundation models (GFMs) by offering standardized benchmarking tools. It supports the evaluation of GFMs with millions of genomic sequences and hundreds of tasks, automating the benchmarking process for open-source GFMs to streamline their development and comparison.
- LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations. This study provides further evidence that language models internally encode signals when they produce non-factual information. Understanding these internal cues can help guide models more effectively and reduce the occurrence of hallucinations, offering a potential strategy for improving their reliability.
- Differential Transformer. Transformers often over-allocate attention to irrelevant context, leading to inefficiencies. This research presents the Diff Transformer, which enhances attention to relevant information while filtering out noise. It introduces a differential attention mechanism that computes attention scores by subtracting two separate softmax attention maps. This subtraction effectively cancels out noise and encourages sparse, more focused attention patterns, improving the model’s performance on tasks requiring precise context understanding.
News
- Brave New World: Leo AI and Ollama Bring RTX-Accelerated Local LLMs to Brave Browser Users. Nvidia’s RTX-Acceleration combined with Ollama allows for running local models in the browser.
- Liquid Foundation Models. Liquid AI has introduced its first generation of Liquid Foundation Models (LFMs), offering state-of-the-art performance while minimizing memory consumption. The LFMs, which are optimized for different hardware platforms, include 1B, 3B, and 40B parameter models. These models are already accessible on platforms like LIQUID PLAYGROUND and will soon be available on Cerebras. They are particularly adept at processing sequential data and provide innovations in efficiency and scalability across industries like financial services and biotechnology.
- Introducing Copilot Labs and Copilot Vision. Microsoft is launching Copilot Labs to test advanced AI tools, including Think Deeper and Copilot Vision. These tools aim to expand the capabilities of their AI systems, offering enhanced functionality and deeper insights.
- OpenAI’s DevDay brings Realtime API and other treats for AI app developers. It’s been a tumultuous week for OpenAI, full of executive departures and major fundraising developments, but the startup is back at it, trying to convince developers to build tools with its AI models at its 2024 DevDay. The company announced several new tools Tuesday, including a public beta of its “Realtime API”, for building apps with low-latency, AI-generated voice responses. It’s not quite ChatGPT’s Advanced Voice Mode, but it’s close.
- Microsoft brings AI-powered overviews to Bing. Microsoft has introduced Bing generative search, an AI-driven feature that gathers and summarizes information from the web, offering users more concise and aggregated search results.
- KoBold Metals, which uses AI to help find critical minerals for the energy transition, raises $491M. Earlier this year, KoBold Metals found what might be one of the largest high-grade copper deposits of all time, with the potential to produce hundreds of thousands of metric tons per year, the company’s CEO said.
- OpenAI gets $4 billion revolving credit line, giving it more than $10 billion in liquidity. OpenAI has secured over $10 billion in liquidity, achieving a valuation of $157 billion following its latest funding round. The company raised $6.6 billion from key investors, including Microsoft and Nvidia, but is contending with substantial operational costs, particularly the need for additional GPUs to support large language model (LLM) training. OpenAI is currently exploring restructuring strategies to enhance financial growth and sustainability within the AI industry.
- Black Forest Labs, the startup behind Grok’s image generator, releases an API. Black Forest Labs, the Andreessen Horowitz-backed startup behind the image generation component of xAI’s Grok assistant, has launched an API in beta — and released a new model.
- DataPelago raises $47M to optimize hardware for analytical workloads. LLMs depend on vast amounts of unstructured data for training, but this data requires extensive cleaning and processing before it becomes useful. Traditional data processing systems, which are based on CPUs and current software architectures, were not designed to handle the scale and complexity of such data, resulting in slow and costly data preparation that hinders AI development. To address these challenges, DataPelago has introduced a Universal Data Processing Engine, designed to overcome performance, cost, and scalability limitations, making AI development faster and more affordable.
- Google brings ads to AI Overviews as it expands AI’s role in search. Google will begin to show ads in AI Overviews, the AI-generated summaries it supplies for certain Google Search queries, and will add links to relevant web pages for some of those summaries as well. It’s also rolling out AI-organized search results pages in the U.S. this week.
- Nobel Physics Prize Awarded for Pioneering A.I. Research by 2 Scientists. Two scientists who contributed to the development of neural networks have been awarded the Nobel Prize in Physics, recognizing their groundbreaking work in advancing artificial intelligence and neural network technologies.
- Introducing the Message Batches API. Anthropic has introduced a new batch processing API that allows developers to submit batches of up to 10,000 queries at once. Each batch is processed within 24 hours and is 50% cheaper than standard API calls, making it a more efficient and cost-effective solution for handling non-time-sensitive tasks.
- Update on Reflection-70B. A detailed post-mortem analysis of the highly anticipated Reflection-70B model revealed issues with its benchmark code, which inflated its performance claims. Although the team has since corrected these bugs, and the model’s performance remains impressive, it does not quite reach the originally advertised levels.
- Four-legged robot learns to climb ladders. The proliferation of robots like Boston Dynamics’ Spot has showcased the versatility of quadrupeds. These systems have thrived at walking up stairs, traversing small obstacles, and navigating uneven terrain. Ladders, however, still present a big issue — especially given how ever present they are in factories and other industrial environments where the systems are deployed.
- Braintrust raises $36M Series A. Braintrust, which helps Airtable, Brex, Notion, and Stripe build AI products, has raised $36M in a Series A led by a16z.
- Clout Kitchen raises $4.45M for AI gaming pal that mimics content creators. Clout Kitchen announced today that it has raised $4.45 million in its seed funding round, which it plans to put towards its new creator-powered products and experiences. The first of these is Backseat AI, an AI-powered buddy for League of Legends that the company created with Tyler “Tyler1” Steinkamp — an AI buddy that can take on the aspect of popular gaming content creators. Clout Kitchen plans to use its funding to expand its team and build out its shared internal tech stack.
- AlphaFold wins Nobel Prize in Chemistry. Demis Hassabis, John Jumper, and David Baker were awarded the Nobel Prize in Chemistry for their groundbreaking work in protein folding, particularly through innovations like AlphaFold. Their contributions have significantly advanced the understanding of protein structures and their implications for science and medicine.
- OpenAI reducing dependency on Microsoft data centers. OpenAI is decreasing its reliance on Microsoft’s data centers by acquiring its own compute infrastructure, allowing greater independence in its operations. Simultaneously, Microsoft is reducing its dependence on OpenAI as it develops and competes with its own AI products, signaling a shift in the dynamics of their partnership.
- TikTok parent company ByteDance has a tool that’s scraping the web 25 times faster than OpenAI. TikTok parent company ByteDance is amassing huge volumes of web data way faster than the other major web crawlers. ByteDance may be planning to release its own LLM, and is aggressively using its web crawler, “Bytespider,” to scrape up data to train its models, Fortune reported.
- Sonair takes a cue from dolphins to build autonomous 3D vision without lidar. Ultrasound is perhaps best known as the technology that enables noninvasive body scans and underwater communication and can help us park our cars. A young startup called Sonair out of Norway wants to employ it for something else: 3D computer vision used in autonomous hardware applications.
- Tesla’s head of vehicle programs jumps to Waymo ahead of robotaxi reveal. Tesla has lost a top executive to Waymo in the lead-up to the EV maker’s robotaxi unveiling on Thursday.
- Autism ABA Therapy with Llama. Meta shares a use case of its Llama model for medical and therapeutic benefit.
- Uber’s EV ridehailing business is maturing. The company also announced it was adding ChatGPT to its driver app to handle EV questions.
- Amazon’s new AI guides can help shoppers find what they need. The new AI Shopping Guides feature aims to help users find what they need with more informed product suggestions.
- TikTok joins the AI-driven advertising pack to compete with Meta for ad dollars. TikTok’s Smart+ is an AI-powered ad-buying tool designed to automate and optimize ad campaigns, giving marketers the option to selectively utilize its features for enhanced performance. The tool seeks to rival Meta’s Advantage+ by offering streamlined ad management and improved return on investment (ROI). Early results indicate significant gains in ad spend efficiency and conversion rates, positioning TikTok as a strong contender in the digital advertising market.
- OpenAI partners with Cosmopolitan and Elle publisher Hearst. ChatGPT will provide citations and direct links to the company’s content.
- Meta debuts new generative AI tools for creating video-based ads. Meta Platforms Inc. today said it’s rolling out a full-screen video tab on Facebook in recognition of the fact that its users spend more time watching videos than anything else on its platforms.
Resources
- Introducing the Open FinLLM Leaderboard. The Open FinLLM Leaderboard provides a dedicated evaluation platform designed specifically for financial language models. It emphasizes key financial tasks like predicting stock movements, analyzing sentiment, and extracting information from financial reports.
- Infinite-Fractal-Stream: Small Scale Proxy for Scaling-Centric ML. Model testing in the image domain is often constrained by low-quality, small datasets like CIFAR10. This GitHub repository provides a tool that generates infinite, complex fractals in the form of images or videos, offering a new approach for testing models.
- Auto Jobs Applier. A highly viral repository leverages language models to automate the job application process, adding an extra layer of personalization to tailor applications for each position.
- Real-World Benchmarks Make Membership Inference Attacks Fail on Diffusion Models. This study uncovers major weaknesses in existing membership inference attacks (MIAs) used to detect unauthorized data usage in diffusion models. It introduces CopyMark, a more realistic benchmark for assessing MIAs on pre-trained models, providing unbiased datasets and fair evaluation techniques to improve the accuracy and reliability of these attacks.
- ImageFolder: Autoregressive Image Generation with Folded Tokens. ImageFolder is a semantic tokenizer developed to balance the trade-off between image reconstruction accuracy and generation quality in visual generative models, improving the overall performance of these models in both tasks.
- Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models. Grounded-VideoLLM is a novel Video-Large Language Model (Video-LLM) created to enhance the fine-grained understanding of specific moments in videos. By incorporating a temporal stream and discrete temporal tokens, the model more effectively captures the relationships between frames and timestamps, improving its ability to interpret and analyze detailed video content.
- Autoregressive Action Sequence Learning for Robotic Manipulation. The Chunking Causal Transformer (CCT) is a new autoregressive architecture developed specifically for robotic manipulation tasks. It is designed to improve the model’s ability to process sequential data efficiently, optimizing performance in real-time robotic control and manipulation scenarios.
- FacePoke. FacePoke is a tool designed for rapid editing of faces in both videos and images, allowing users to make quick adjustments and modifications with ease.
- pipeline_parallel.py. A large model training lead at Hugging Face has shared an excellent 200-line example of parallelism built from scratch, demonstrating efficient techniques for distributing computational tasks, which is particularly useful for large-scale model training.
- CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs. As language models become increasingly proficient at writing code, many existing benchmarks are approaching saturation. This paper proposes a more challenging benchmark designed to assess how well models perform on reasoning and code generation tasks, pushing beyond basic code-writing capabilities to evaluate deeper problem-solving skills.Intensify.Intensify is a Python package that allows you to colorize text based on intensity values. It provides an easy-to-use interface for applying color gradients to text or background colors in the terminal.
- Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality. JEDi is a new metric built on the Joint Embedding Predictive Architecture (JEPA), designed to enhance evaluation accuracy with fewer samples. It better aligns with human assessments, making it a more robust alternative to the FVD (Fréchet Video Distance) metric for evaluating generative models.
- PRFusion: Toward Effective and Robust Multi-Modal Place Recognition with Image and Point Cloud Fusion. PRFusion and PRFusion++ are multimodal models developed to enhance place recognition in robotics and computer vision. By combining information from multiple sensory inputs, these models improve the accuracy and robustness of place recognition tasks, making them more effective in real-world applications.
- Fine-Tuning CLIP’s Last Visual Projector: A Few-Shot Cornucopia. This paper presents ProLIP, a novel method for adapting vision-language models such as CLIP without adding additional parameters. ProLIP fine-tunes only the final projection matrix of the vision encoder, enabling it to deliver strong performance in few-shot classification tasks while maintaining the model’s efficiency.
- ScienceAgentBench. The benchmark code for the science agent test is designed to evaluate how effectively models can contribute to novel scientific discoveries. It provides a framework for assessing a model’s ability to generate innovative ideas, solve complex scientific problems, and make meaningful advances in various scientific fields.
- Controlled Visual Generation. Controllable AutoRegressive Modeling (CAR) is a novel framework that introduces precise control mechanisms to pre-trained visual autoregressive models. This method enables more refined and targeted image generation by progressively improving control representations, allowing for fine-tuned outputs with reduced computational resources.
- PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners. PredFormer is a newly developed transformer-based method for spatiotemporal predictive learning, offering superior performance in both accuracy and efficiency compared to existing approaches. It excels in tasks that involve predicting changes over time and space, making it a powerful tool for various applications in fields like video analysis, weather forecasting, and robotics.
- GenSim2: Scaling Robotic Data Generation with Multi-modal and Reasoning LLMs. This paper presents an innovative approach to scaling robotic data collection by utilizing an enhanced, high-quality physics simulation dataset. The improved simulation environment enables more efficient data generation for training robots, offering a scalable and cost-effective method to collect large amounts of accurate and diverse data for robotic learning and development.
- Learning Efficient and Effective Trajectories for Differential Equation-based Image Restoration. This project introduces a novel differential equation-based approach for image restoration. By leveraging mathematical models grounded in differential equations, the method enhances the ability to recover and restore degraded or noisy images, providing improved accuracy and performance in image restoration tasks.
- Pixtral 12B. The Mistral team has provided detailed insights into the training process and architecture of their vision-language model, which has demonstrated solid performance. The model incorporates advanced techniques for effectively integrating visual and linguistic data, allowing it to perform well on a variety of tasks that require understanding both images and text. The shared information includes specifics on data preprocessing, model architecture, and the optimization strategies employed during training.
- MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. MLE-bench is a benchmark created to evaluate AI agents’ capabilities in machine learning engineering. It includes a curated selection of 75 Kaggle competitions to test various skills, such as model training, dataset preparation, and optimization. The benchmark aims to assess how well AI agents can handle practical machine learning tasks, providing a comprehensive evaluation of their engineering proficiency.
- Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate. The Modality Integration Rate (MIR) is a new metric designed to evaluate the effectiveness of multi-modal pre-training in Large Vision Language Models. It measures how well different modalities, such as visual and textual data, are integrated during the pre-training process, offering insights into the model’s ability to leverage information from both sources to improve performance on multi-modal tasks.
- Aria: First Open Multimodal Native MoE Model. A highly impressive new vision-language model has been released with open weights, code, and a comprehensive research report. It achieves performance on par with closed models for long video understanding, a challenge that has proven difficult for other open models like Pixtral and Molmo. This advancement represents a significant breakthrough in the field of open-source vision-language models.
- IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation. IterComp is a new framework developed to enhance compositional text-to-image generation by integrating the strengths of multiple advanced diffusion models, including RPG, Stable Diffusion 3, and FLUX. By leveraging these models, IterComp improves the quality and coherence of generated images, especially when handling complex textual prompts that require multiple elements to be composed accurately.
- MatMamba. MatMamba is a novel architecture for sequence processing, building upon the Mamba2 framework by incorporating a Matryoshka-like design. This approach allows a single model to be trained at multiple granularities, enabling the extraction of various smaller, nested submodels. This hierarchical structure enhances flexibility and efficiency, allowing the model to adapt to different levels of complexity and resource constraints.
- O1 replication progress report. Researchers from GAIR and NYU have been investigating the critical algorithmic advancements behind OpenAI’s o1 model’s exceptional performance. In their report, they introduce the concept of “Journey Learning” data, a novel approach that, when used in training, boosts math performance by 8% in absolute terms. This innovation highlights how specific data types can significantly enhance a model’s reasoning and problem-solving abilities.
Perspectives
- Nuclear power for AI: what it will take to reopen Three Mile Island safely. As Microsoft strikes a deal to restart a reactor at the notorious power station, Nature talks to nuclear specialists about the unprecedented process.
- ‘In awe’: scientists impressed by latest ChatGPT model o1. The chatbot excels at science, beating PhD scholars on a hard science test. But it might ‘hallucinate’ more than its predecessors.
- Can AI have common sense? Finding out will be key to achieving machine intelligence. The advent of LLMs has reopened a debate about the limits of machine intelligence — and requires new benchmarks of what reasoning consists of.
- How your brain detects patterns in the everyday: without conscious thought. Neurons in certain brain areas integrate ‘what’ and ‘when’ information to discern hidden order in events in real time.
- AI to the rescue: how to enhance disaster early warnings with tech tools. Artificial intelligence can help to reduce the impacts of natural hazards, but robust international standards are needed to ensure best practice.
- Before Mira Murati’s surprise exit from OpenAI, staff grumbled its o1 model had been released prematurely. OpenAI’s accelerated development and safety testing of its latest models, such as GPT-4o and o1, have led to internal friction, resulting in the departure of several senior staff members. The rapid pace of development has raised concerns about the thoroughness of the safety protocols, contributing to tensions within the organization.
- I Quit Teaching Because of ChatGPT. This professor resigned from teaching due to the widespread use of large language models (LLMs) like ChatGPT among students, which they felt undermined academic integrity and the traditional learning process.
- Three Subtle Examples of Data Leakage. This article examines the risks of data leakage in machine learning, showcasing two real-world cases where improper data handling resulted in misleading model performance. In one instance, a company incorrectly filtered data by an upper price limit before modeling, while another organization encountered problems by not following a strict chronological split. The key lessons emphasize the critical need for detecting data leakage and understanding its detrimental effects on model accuracy and reliability.
- The real data wall is billions of years of evolution. AI development is encountering a potential obstacle known as the “data wall,” as language models near the limit of available textual data for training. This article challenges the idea of using human analogies to overcome these data constraints, pointing out that human intelligence results from vast amounts of data and long evolutionary processes, which differ fundamentally from AI. While human learning strategies may not directly translate to AI, this doesn’t preclude progress through other modalities, such as multimodal data, or advancements in algorithms that could push AI capabilities further.
- AI will use a lot of energy. That’s good for the climate. AI data centers are significantly increasing the demand for clean, 24/7 energy, prompting tech giants to invest heavily in renewable and nuclear power solutions. This growing demand is expected to accelerate the cost reduction of clean energy technologies, driven by their learning rates. Over time, the energy needs of AI could lead to policy shifts and advancements in clean energy infrastructure, fostering faster adoption and development of sustainable energy sources.
- I want to break some laws too. This article explores the use of an automated data cleaning pipeline inspired by the Minipile method, which prunes datasets to deliver significant performance gains with only a fraction of the original data size. By leveraging techniques such as few-shot prompting and clustering, the approach streamlines dataset refinement for AI training, challenging traditional scaling laws by prioritizing data quality over quantity. The results indicate that using foundational datasets with more refined data can optimize AI model training, reducing resource consumption while boosting performance.
Meme of the week
What do you think about it? Some news that captured your attention? Let me know in the comments
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.
Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.
or you may be interested in one of my recent articles: