WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES
ML news: Week 4–10 March
Musk sues OpenAI (which fights back), Claude 3 and LeChat released, and much more
The most interesting news, repository, articles, and resources of the week
Check and star this repository where the news will be collected and indexed:
You will find the news first in GitHub. Single posts are also collected here:
Research
- HyperAttention: Long-context Attention in Near-Linear Time. It’s well accepted — and informally verified — that HyperAttention is the key to Gemini’s incredible 1 million+ token context window’s success.
- Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning. An attempt is made to explain the success of MuP hyperparameter transfer theoretically in this study. The greatest eigenvalue of the training loss Hessian, according to its creators, is unaffected by the network’s depth or breadth.
- WebArena: A Realistic Web Environment for Building Autonomous Agents. The possibility for Agents to handle a range of digital responsibilities has the community enthused. But even the most advanced general-purpose models find it difficult to accomplish jobs where people achieve more than 70% of the time. It is becoming evident that these activities could require models that have been carefully trained.
- Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models. Latent space smoothness in text-to-image diffusion models is a problem that is addressed by a novel method called Smooth Diffusion. With this technique, even little changes in input will result in a steady and progressive alteration of the visuals.
- Rethinking Inductive Biases for Surface Normal Estimation. A technique called DSNIE significantly enhances monocular surface normal estimation, which finds use in various computer graphics fields.
- CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition. CricaVPR presents a revolutionary method that focuses on the relationships between many photos, even when they are taken in various situations, to improve visual place identification.
- Empowering Large Language Model Agents through Action Learning. investigates open-action learning for language agents using an iterative learning strategy that uses Python functions to create and improve actions; on each iteration, the proposed framework (LearnAct) modifies and updates available actions based on execution feedback, expanding the action space and improving action effectiveness; the LearnAct framework was tested on Robotic planning and AlfWorld environments, showing 32% improvement in agent performance in AlfWorld when compared to ReAct+Reflexion.
- PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval. demonstrates how to use LLMs to integrate several approaches, such as retrieval augmentation, fine-tuning, tool utilization, and more; while the suggested framework is used in the context of urban and spatial planning, many of the insights and useful advice are applicable to other fields as well.
- Evo: Long-context modeling from molecular to genome scale. Introducing Evo, a long-context biological foundation model based on the StripedHyena architecture that generalizes across the fundamental languages of biology: DNA, RNA, and proteins. Evo is capable of both prediction tasks and generative design, from molecular to whole genome scale (over 650k tokens in length). Evo is trained at a nucleotide (byte) resolution, on a large corpus of prokaryotic genomic sequences covering 2.7 million whole genomes.
- Resonance RoPE: Improving Context Length Generalization of Large Language Models. To assist LLMs in comprehending and producing text in longer sequences than they were first trained on, researchers have created a new method dubbed Resonance RoPE. By using less processing power, our approach outperforms the current Rotary Position Embedding (RoPE) technique and improves model performance on lengthy texts.
- The All-Seeing Project V2: Towards General Relation Comprehension of the Open World. The All-Seeing Project V2 introduces the ASMv2 model, which blends text generation, object localization, and understanding the connections between objects in images.
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark. A formidable task is offered by a new dataset named GPQA, which has 448 difficult multiple-choice questions covering physics, chemistry, and biology. Even domain specialists have difficulty — they only score about 65% accuracy — while non-experts only get 34%. Only 39% of advanced AI systems, such as GPT-4, are accurate. The goal of this dataset is to provide techniques for monitoring AI results in challenging scientific problems.
- SURE: SUrvey REcipes for building reliable and robust deep networks. SURE is a revolutionary strategy that integrates multiple approaches to increase the accuracy of deep neural network uncertainty predictions, particularly for image classification applications.
- Stable Diffusion 3: Research Paper. Stable Diffusion 3 outperforms state-of-the-art text-to-image generation systems such as DALL·E 3, Midjourney v6, and Ideogram v1 in typography and prompt adherence, based on human preference evaluations. Our new Multimodal Diffusion Transformer (MMDiT) architecture uses separate sets of weights for image and language representations, which improves text understanding and spelling capabilities compared to previous versions of SD3.
- Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents. These days, language models are quite good at responding to queries. As a result, the majority of benchmarks in use today are saturated. ‘Researchy’ questions are a new breed of open-ended questions that call for several steps to complete. The source of this specific dataset is search engine queries. It includes instances where GPT-4 had trouble responding to questions.
- UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control. A novel method for improving motion quality and semantic coherence in films produced by text-to-video models is presented by UniCtrl. Employing motion injection and cross-frame self-attention approaches enhances video coherence and realism without requiring further training.
- VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT. With natural language queries, VTG-GPT provides a revolutionary GPT-based technique that can precisely identify particular video segments without the need for fine-tuning or training.
- MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training. With the same performance as OpenAI’s original CLIP model, MobileClip operates seven times quicker. It may be utilized for a variety of language and visual activities on-device.
- Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures. Vision-RWKV provides an effective solution for high-resolution image processing by modifying the RWKV architecture from NLP for use in vision challenges.
- Design2Code: How Far Are We From Automating Front-End Engineering? It’s hard to take pictures of a design and turn them into code. This study suggests an 18B model as a baseline and assessments imply that we are about there for performing this on basic designs. GPT-4V-generated code is sometimes preferred to human-synthesized code.
- MathScale: Scaling Instruction Tuning for Mathematical Reasoning. Researchers created two million route issues using fake data. After training a 7B model, they discovered that it performed well when compared to the most advanced big language models.
- Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos. The KEPP system offers a fresh method for organizing and carrying out difficult jobs. The approach, which makes use of a probabilistic knowledge network, enables the model to arrange activities logically to accomplish a goal.
- KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents. KnowAgent presents an innovative method for enhancing the planning abilities of big language models through the incorporation of explicit action information. The method leads LLMs through more rational planning trajectories, which improves their performance on challenging tasks.
- tinyBenchmarks: evaluating LLMs with fewer examples. In this paper, we investigate strategies to reduce the number of evaluations needed to assess the performance of an LLM on several key benchmarks. This work shows that you can reliably evaluate language model performance with as few as 100 examples from popular benchmarks.
- 3D Diffusion Policy. DP3 presents a novel method for imitation learning that effectively teaches robots difficult abilities by fusing diffusion strategies with 3D visual data.
- Co-LLM: Learning to Decode Collaboratively with Multiple Language Models. Using an innovative approach, multiple huge language models can collaborate by alternately producing text token by token. With the use of this tactic, models are better able to apply their distinct advantages and areas of competence to a variety of activities, including following instructions, answering questions related to a given domain, and solving reasoning-based problems.
News
- AI-generated images of Trump with Black voters being spread by supporters. No evidence to tie fake images, including one created by Florida radio host, to Trump campaign, BBC Panorama investigation finds
- Elon Musk sues OpenAI over AI threat. OpenAI is not so open now, Musk claims, following the closed-source release of the company’s artificial general intelligence technology under Microsoft.
- OpenAI wants to make a walking, talking humanoid robot smarter. Figure’s founder Brett Adcock says a new partnership with OpenAI could help its robots hold conversations and learn from their mistakes over time.
- MagicLab’s humanoid can toast marshmallows, fold clothes, and dance. Miniature high-torque servo actuators combined with sensitive multi-dimensional pressure sensors enabled the team to create an exceptionally dexterous hand–MagicBot.
- Amazon to spend $1 billion on startups that combine AI with robots. Amazon’s $1 billion industrial innovation fund is to step up investments in companies that combine artificial intelligence and robotics, as the e-commerce giant seeks to drive efficiencies across its logistics network.
- Claude 3 released. Three new Claude 3 family models have been trained by Anthropic, the best of which achieves benchmark scores that GPT4 has publicly disclosed. It excels at visual tasks and is a multimodal model as well. Claude’s coding skills have significantly improved with this version, which is significant.
- ChatGPT can read its answers out loud. OpenAI’s new Read Aloud feature for ChatGPT could come in handy when users are on the go by reading its responses in one of five voice options out loud to users. It is now available on both the web version of ChatGPT and the iOS and Android ChatGPT apps.
- Adobe reveals a GenAI tool for music. Adobe unveiled Project Music GenAI Control, a platform that can generate audio from text descriptions (e.g. “happy dance,” “sad jazz”) or a reference melody and let users customize the results within the same workflow.
- OpenAI fires back at Elon Musk in legal fight over breach of contract claims.ChatGPT maker releases emails in support of claim businessman backed plan to create for-profit unit
- OpenAI and Elon Musk. In response to Elon Musk’s complaint, OpenAI provided screenshots of emails between Elon Musk, Greg Brockman, Sam Altman, and Ilya Sutskever, as well as their version of events. According to the receipts, Musk thought there was little hope for OpenAI to succeed and agreed that some models should be closed-source.
- Perplexity AI Reportedly Raising Additional Money At Significantly Higher Valuation Cap Than $520M. Perplexity AI, a rising star in the field of artificial intelligence, is reportedly in discussions to secure additional funding at a valuation significantly higher than its previous round.
- Le Chat. Using its Mistral models, Mistral AI has introduced ‘le Chat Mistral,’ a new multilingual conversational assistant with an enterprise edition for companies.
- Neuralink brain chip: advance sparks safety and secrecy concerns. Elon Musk announced this week that his company’s brain implant has allowed a person to move a computer mouse with their mind.
- Ex-Google engineer arrested for alleged theft of AI secrets for Chinese firms. Linwei Ding, facing four counts of theft of trade secrets, is accused of transferring confidential information to his personal account
- Mistral x Snowflake. Snowflake, the Data Cloud company, and Mistral AI, one of Europe’s leading providers of AI solutions, today announced a global partnership to bring Mistral AI’s most powerful language models directly to Snowflake customers in the Data Cloud.
- Moondream 2 small vision language model. Moondream is a tiny language model built on SigLIP and Phi-2. The benchmark performance has been much enhanced in this second edition, which is licensed for commercial use. It is perfect for describing visuals and operating on low-end computing hardware.
- Driverless startup Waymo to test self-driving vehicles with no human driver in Austin. Autonomous vehicle company Waymo will begin testing driverless cars, with no human behind the wheel, in Austin, starting Wednesday.
- Google brings Stack Overflow’s knowledge base to Gemini for Google Cloud. Developer Q&A site Stack Overflow is launching a new program today that will give AI companies access to its knowledge base through a new API, aptly named OverflowAPI.
- Brave’s Leo AI assistant is now available to Android users. Brave is launching its AI-powered assistant, Leo, to all Android users. The assistant allows users to ask questions, translate pages, summarize pages, create content, and more. The Android launch comes a few months after Brave first launched Leo on desktop. Brave says Leo will be available on iOS devices in the coming weeks.
- Inflection-2.5.A new model has been introduced by Inflection to power Pi, its personal assistant. The model achieves remarkable reasoning scores on benchmarks and performs within 94% of the GPT-4. In comparison to GPT-4, Inflection claims that training only required 40% of the computing. This post offers an intriguing discovery: a typical conversation with Pi lasts 33 minutes.
- Cohere and Accenture Collaborate to Accelerate Enterprise AI Adoption. Cohere and Accenture are working together to provide over 9,000 enterprise clients with cohere embedding technology.
- Microsoft’s Mistral deal beefs up Azure without spurning OpenAI. Microsoft investing in Mistral puts the focus on its Azure model offerings.
Resources
- 2.4x faster Gemma + 58% less VRAM. You can now finetune Gemma 7b 2.43x faster than HF + Flash Attention 2 with 57.5% less VRAM use. When compared to vanilla HF, Unsloth is 2.53x faster and uses 70% less VRAM.DUSt3R.With the help of this project, you may create 3D representations in GLB form by taking a few photos of a site and reconstructing them for usage in 3D applications.
- Datasets for Large Language Models: A Comprehensive Survey. an extensive (more than 180 pages) review and analysis of LLM datasets.
- Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding — A Survey. an overview of LLMs for tabular data jobs that includes important methods, measurements, datasets, models, and optimization strategies; it also discusses unmet issues and offers suggestions for future lines of inquiry.
- Using Claude 3 Opus for video summarization. Andrej Karpathy challenged me to write a blog article based on one of his latest videos in a lengthy context. This job was completed by Claude 3 with assistance from some pre-processing data. The end product is an excellent and captivating blog post.
- Dual-domain strip attention for image restoration. A new technique that greatly enhances image restoration tasks is the dual-domain strip attention mechanism.
- Open-Sora-Plan. This project aims to reproducing Sora (Open AI T2V model), but we only have limited resources. We deeply wish the all open-source community can contribute to this project.
- ML system design: 300 case studies to learn from. We put together a database of 300 case studies from 80+ companies that share practical ML use cases and learnings from designing ML systems.
- orca-math-word-problems-200k . This dataset contains ~200K grade school math word problems. All the answers in this dataset are generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the Potential of SLMs in Grade School Math for details about the dataset construction.
- mlx-swift-examples. Apple created the MLX framework, which is used to train AI models on Macs. This repository demonstrates how to use Swift for model training on mobile devices. An MNIST classifier model can be trained with just one on an iPhone.
- Text Clustering. A free and open-source text clustering tool that makes it simple and rapid to embed, cluster, and semantically label clusters. On 100k samples, the full pipeline runs in 10 minutes.EasyLM.Large language models (LLMs) made easy, EasyLM is a one-stop solution for pre-training, finetuning, evaluating, and serving LLMs in JAX/Flax. EasyLM can scale up LLM training to hundreds of TPU/GPU accelerators by leveraging JAX’s pjit functionality.
- You can now train a 70b language model at home. Today, we’re releasing Answer.AI’s first project: a fully open-source system that, for the first time, can efficiently train a 70b large language model on a regular desktop computer with two or more standard gaming GPUs (RTX 3090 or 4090). This system, which combines FSDP and QLoRA, is the result of a collaboration between Answer.AI, Tim Dettmers (U Washington), and Hugging Face’s Titus von Koeller and Sourab Mangrulkar
- .Training Models at Scale. The goal of this tutorial is to provide a comprehensive overview of techniques and strategies used for scaling deep learning models and to provide a hands-on guide to implement these strategies from scratch in JAX with Flax using shard_map.
- Genstruct 7B. Genstruct 7B is an instruction-generation model, designed to create valid instructions given a raw text corpus. This enables the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus.
- Fructose. Fructose is a Python package to create a dependable, strongly typed interface around an LLM call.
- Efficient Multi-Head Attention Implementations. Different implementations of the widely used multi-headed attention module in contemporary LLMs varied in speed by over ten times. This notebook lists a handful and compares how well they perform.
- US regulators investigate whether OpenAI investors were misled, say reports. Internal communications from CEO Sam Altman reportedly under scrutiny in SEC inquiry
- Microsoft introduces Copilot AI chatbot for finance workers in Excel and Outlook. Microsoft is launching a Copilot for Finance, which it said will be able to perform a handful of common role-specific actions in Excel and Outlook.
Perspectives
- On the Societal Impact of Open Foundation Models. a position paper that centers on open foundation models and discusses their advantages, disadvantages, and effects; it also suggests a framework for risk analysis and clarifies why, in certain situations, the marginal risk of these models is low. Finally, it provides a more sober evaluation of the open foundation models’ effects on society.
- Towards Long Context RAG. The amazing one-million-word context window that Google’s Gemini 1.5 Pro has brought to the AI community has sparked a debate regarding the future viability of retrieval-augmented generation (RAG).
- Aggregator’s AI Risk. The impact of the Internet, especially through Aggregators like Google and Meta, is comparable to that of the printing press on the spread of knowledge and the establishment of nation-states. However, the rise of generative AI puts the Aggregator model to the test by offering unique solutions that represent ingrained worldviews. This could undermine the Aggregator economics’s universal appeal and point to the need for a move toward personalized AI in order to preserve its dominance.
- Is Synthetic Data the Key to AGI? The caliber of training data has a major impact on how effective large language models are. By 2027, projections indicate that there will be a shortage of high-quality data. A possible answer to this problem is synthetic data generation, which could change internet business models and emphasize the significance of fair data access and antitrust laws.
- AI Research Internship Search as a CS PhD Student.Tips and thoughts from my relatively successful summer research internship hunt during third-year Computer Science PhD study.
- How AI Could Disrupt Hollywood. New platforms and tools may allow a person to create a feature-length film from their living room. But can they really compete with the studios?
- Training great LLMs entirely from ground zero in the wilderness as a startup. Reka’s creator and well-known GPU critic Yi Tay detailed their experience building very powerful language models outside of Google in a blog post. The primary obstacles stem from hardware instability and cluster issues. They also had difficulties with software maturity.
- Claude 3 Is The Most Human AI Yet. Anthropic’s Claude 3, a large language model similar to GPT-4, is notable not so much for its cost-effectiveness or benchmark test results as for its distinctly human-like, creative, and naturalistic interaction quality. This represents a major breakthrough in AI’s capacity to collaborate imaginatively with writers.
- Licensing AI Means Licensing the Whole Economy.AI is a vast process employing statistical approaches, and it would be impractical to control its use across all organizations. Therefore, regulating AI like a tangible commodity is incorrect. Given AI’s imminent economic ubiquity, targeted regulation for particular misuses — akin to current strategies for programming or email abuses — is more successful.
- Is ChatGPT making scientists hyper-productive? The highs and lows of using AI. Large language models are transforming scientific writing and publishing. However, the productivity boost that these tools bring could have a downside.
- Artificial intelligence and illusions of understanding in scientific research. Why are AI tools so attractive and what are the risks of implementing them across the research pipeline? Here we develop a taxonomy of scientists’ visions for AI, observing that their appeal comes from promises to improve productivity and objectivity by overcoming human shortcomings.
- AI will likely increase energy use and accelerate climate misinformation — report. Claims that artificial intelligence will help solve the climate crisis are misguided, warns a coalition of environmental groups
- We Need Self-Driving Cars. Anyone rooting against self-driving cars is cheering for tens of thousands of deaths, year after year. We shouldn’t be burning self-driving cars in the streets. We should be celebrating…Subprime Intelligence.Significant problems in OpenAI’s Sora demonstrate the limitations of generative AI’s comprehension. The technology presents both practical obstacles and revolutionary possibilities, as seen by its high computing needs and potential impact on the creative industry.
- Sora, Groq, and Virtual Reality. A few years ago, Facebook’s drive into the metaverse looked misguided, and the idea of the metaverse appeared like fiction from Ernest Cline’s novel. Things feel different now. Groq’s deterministic circuits streamline machine-learning algorithms for quicker processing, while Sora creates intricate video situations. The combination of these developments brings us one step closer to real-time video simulation and full-fledged virtual reality.
- AI Is Like Water. For GenAI companies to have a competitive advantage, technology alone is no longer sufficient. This means that since the basic product is virtually the same, GenAI and bottled water are comparable. The primary differentiators need to originate from elements like distribution, user experience, perceived customer value, branding, and marketing.
Meme of the week
What do you think about it? Some news that captured your attention? Let me know in the comments
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.
Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.
or you may be interested in one of my recent articles: