WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES
ML news: Week 19–25 February
Google put a foot in open-source LLM, Stable diffusion 3, and much more this week
The most interesting news, repository, articles, and resources of the week
Check and star this repository where the news will be collected and indexed:
You will find the news first in GitHub. Single posts are also collected here:
Research
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning. Deciding which examples to employ when aligning language models with preference data is frequently difficult. This paper proposes an unexpectedly strong baseline: pick the 1,000 longest cases.
- Extreme Video Compression with Pre-trained Diffusion Models. As diffusion models get more adept at synthesizing pictures and videos, they may be used for other purposes due to their extensive “knowledge” of the world. This study discovered an astounding 0.02 bits per pixel reduction. The secret here was to track perceptual similarities along the route and deliver a new frame of the original movie as necessary.
- OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset. To train open-source Large Language Models in math that equal the performance of closed-source models, researchers have developed a new dataset called OpenMathInstruct-1. With 1.8 million problem-solution pairings, this innovation paves the way for more competitive and approachable AI systems for math teaching.
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. A feature of the Transformer design that allows it to consume less memory at inference time is the quantization of the KV cache. The process of decreasing floating point accuracy with the least amount of quality loss is called quantization.
- Pushing the Limits of Zero-shot End-to-End Speech Translation. ZeroSwot is a novel approach to voice Translation (ST) that addresses the data scarcity and distinctions between text and voice. It may operate with a multilingual translation model by using special strategies to train a voice encoder using only speech recognition data.
- Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE). A novel technique called SpLiCE simplifies the complicated visual data in CLIP.
- TDViT: Temporal Dilated Video Transformer for Dense Video Tasks. A novel Temporal Dilated Video Transformer (TDViT) has been created to enhance the analysis of tasks involving dense videos, like object detection in videos frame by frame.
- Generative Representational Instruction Tuning. A model that creates embeddings and text has been trained and released by the Contextual team. It performs noticeably better than a single specialized model. With embedding as the output modality, the model offers an intriguing interpretation of the multi-modal trend.
- LoRA+: Efficient Low-Rank Adaptation of Large Models. To improve on the current Low-Rank Adaptation (LoRA) technique for fine-tuning big models, this work introduces LoRA+. By applying multiple learning rates for important process components, LoRA+ achieves improved performance and faster fine-tuning without raising processing loads.
- GaussianObject: Just Taking Four Images to Get A High-Quality 3D Object with Gaussian Splatting. We propose GaussianObject, a framework to represent and render the 3D object with Gaussian splatting, that achieves high rendering quality with only 4 input images.
- MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single to Sparse-view 3D Object Reconstruction. This paper presents a neural architecture MVDiffusion++ for 3D object reconstruction that synthesizes dense and high-resolution views of an object given one or a few images without camera poses.
- ChatterBox: Multi-round Multimodal Referring and Grounding. A vision-language model called ChatterBox performs exceptionally well in multimodal dialogues, particularly in the recently defined job of multimodal multi-round referring and grounding.
- Large language models streamline automated machine learning for clinical studies. A knowledge gap persists between machine learning developers and clinicians. Here, the authors show that the Advanced Data Analysis extension of ChatGPT could bridge this gap and simplify complex data analyses, making them more accessible to clinicians.
- Extracting accurate materials data from research papers with conversational language models and prompt engineering. Efficient data extraction from research papers accelerates science and engineering. Here, the authors develop an automated approach that uses conversational large language models to achieve high precision and recall in extracting materials data.
- GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis. GradSafe is a novel technique that can identify dangerous prompts in big language models without requiring a lot of training. Compared to existing approaches, it can identify dangerous prompts more accurately by examining the gradients of certain parameters.
- Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition. A novel technique called Class-Aware Mask-guided (CAM) feature refinement improves text recognition in challenging environments.
- Object Recognition as Next Token Prediction. an innovative approach to object recognition that makes use of a language decoder. With this approach, text tokens are predicted from picture embeddings by using a customized non-causal attention mask. It makes it possible to sample many labels in parallel effectively.
- TIER: Text and Image Encoder-based Regression for AIGC Image Quality Assessment. To evaluate the quality of the generated images, TIER makes use of both written prompts and the images that result from them.
News
- Anthropic takes steps to prevent election misinformation. Called Prompt Shield, the technology, which relies on a combination of AI detection models and rules, shows a pop-up if a U.S.-based user of Claude, Anthropic’s chatbot, asks for voting information. The pop-up offers to redirect the user to TurboVote, a resource from the nonpartisan organization Democracy Works, where they can find up-to-date, accurate voting information.
- OpenAI’s next AI product could be after your job (again). OpenAI is said to be developing AI agents that automate even more complex tasks, though their launch timeline remains unknown. One AI agent is said to take over the customer’s device to perform tasks like transferring data from a document to a spreadsheet, filling out expense reports, and entering them into accounting software. The other AI agent is said to perform more research-oriented, web-based tasks, such as creating itineraries and booking flight tickets.
- Our next-generation model: Gemini 1.5. In fact, we’re ready to introduce the next generation: Gemini 1.5. It shows dramatic improvements across several dimensions and 1.5 Pro achieves comparable quality to 1.0 Ultra while using less computing.
- OpenAI on track to hit $2bn revenue milestone as growth rockets. Thanks in large part to ChatGPT’s enormous success, OpenAI has reached an annual revenue run rate of over $2 billion, making it one of the fastest-growing tech companies.
- Sam Altman wants Washington’s backing for his $7 trillion AI chip venture. The OpenAI CEO is working to secure US government approval for the project as it risks raising national security and antitrust concerns, Bloomberg reported.
- ‘Gemini Business’ and ‘Gemini Enterprise’ plans for Google Workspace are coming. The upcoming changelog — as spotted by Testing Catalog and Dylan Roussel on X/Twitter today — reveals the existence of “Gemini Business” and “Gemini Enterprise” plans. This will give “Google Workspace customers access to one of Google’s most capable Al models, 1.0 Ultra in Gemini, and enterprise-grade data protections.”
- OpenAI Reaches $80 Billion Valuation In Venture Firm Deal, Report Says. OpenAI inked a deal with venture capital firm Thrive Capital that boosted its valuation to $80 billion or more, the New York Times reported, a nearly threefold increase in value from just nine months ago.
- Magic raises $117m to continue code generation models. We’ve raised $117M to build an AI software engineer.
- SoftBank Founder Masayoshi Son Aims to Raise $100 Billion for New Chip Venture, “Izanagi”. Masayoshi Son, the visionary founder of SoftBank Group Corp., has set his sights on revolutionizing the semiconductor industry with the launch of Izanagi, a groundbreaking chip venture backed by a staggering $100 billion investment.
- Scribe $25M Series B. To further its AI-driven platform, Scribe has secured a Series B fundraising round headed by Redpoint Ventures. This round aims to speed up the generation of visual step-by-step tutorials and enable knowledge exchange between enterprises.
- Amazon AGI Team Say Their AI Is Showing “Emergent Abilities”.” Big Adaptive Streamable TTS with Emergent Abilities” (BASE TTS), a language model created by Amazon AGI researchers, exhibits “state-of-the-art naturalness” in conversational text and demonstrates language skills that it wasn’t particularly trained on.
- Gemma: Introducing new state-of-the-art open models. We’re releasing model weights in two sizes: Gemma 2B and Gemma 7B. Each size is released with pre-trained and instruction-tuned variants. Ready-to-use Colab and Kaggle notebooks, alongside integration with popular tools such as Hugging Face, MaxText, NVIDIA NeMo, and TensorRT-LLM, make it easy to get started with Gemma.
- Reddit has a new AI training deal to sell user content. Over a decade of valuable user content is now for sale as Reddit preps to go public.
- Apple Developing AI Tool to Help Developers Write Code for Apps. Apple is working on an updated version of Xcode that will include an AI tool for generating code, reports Bloomberg. The AI tool will be similar to GitHub Copilot from Microsoft, which can generate code based on natural language requests and convert code from one programming language to another.
- Stable Diffusion 3. Announcing Stable Diffusion 3 in early preview, our most capable text-to-image model with greatly improved performance in multi-subject prompts, image quality, and spelling abilities.
- How Bret Taylor’s new company is rethinking customer experience in the age of AI. The two founders fundamentally see AI agents as a new technology category, providing an entirely new way for customers to interact with brands to improve their overall experience.
- Introducing Phind-70B — closing the code quality gap with GPT-4 Turbo while running 4x faster. We’re excited to announce Phind-70B, our largest and most performant model to date. Running at up to 80 tokens per second, Phind-70B gives high-quality answers for technical topics without making users make a cup of coffee while they wait. Phind-70B scores 82.3% on HumanEval, beating the latest GPT-4 Turbo (gpt-4–0125-preview) score of 81.1% in our evaluation.
- Marqo Raises $12.5 Million to Help Businesses Build Generative AI Applications. Marqo has raised $12.5 million in a Series A funding round to advance the adoption of its search platform that helps businesses build generative artificial intelligence (AI) applications that are more relevant and up-to-date.
Resources
- minbpe. Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is “byte-level” because it runs on UTF-8 encoded strings.
- GPTScript. GPTScript is a new scripting language to automate your interaction with a Large Language Model (LLM), namely OpenAI. The ultimate goal is to create a fully natural language-based programming experience. The syntax of GPTScript is largely natural language, making it very easy to learn and use.
- QWEN. We opensource our Qwen series, now including Qwen, the base language models, namely Qwen-1.8B, Qwen-7B, Qwen-14B, and Qwen-72B, as well as Qwen-Chat, the chat models, namely Qwen-1.8B-Chat, Qwen-7B-Chat, Qwen-14B-Chat, and Qwen-72B-Chat.
- Sora Reference Papers. A collection of all papers referenced in OpenAI’s “Video generation models as world simulators”
- repeng. Control vectors are a low-cost means of controlling the output of semantic generation. Compared to LoRA, they are less expensive to train yet may still be fairly powerful. It’s made simpler with this library.OpenRLHF.This is a Ray-based implementation of RLHF for Mistral and other Llama-style models. Several PPO stabilizing techniques are included to enhance performance.
- 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations. To enhance robot manipulation, the 3D Diffuser Actor blends 3D scene representations with diffusion strategies. Robots are better able to comprehend and engage with their surroundings thanks to this AI-driven method.
- How to jointly tune learning rate and weight decay for AdamW. AdamW is often considered a method that decouples weight decay and learning rate. In this blog post, we show that this is not true for the specific way AdamW is implemented in Pytorch. We also show how to adapt the tuning strategy to fix this: when doubling the learning rate, the weight decay should be halved.
- OpenLLMetry-JS. OpenLLMetry-JS is a set of extensions built on top of OpenTelemetry that gives you complete observability over your LLM application. Because it uses OpenTelemetry under the hood, it can be connected to your existing observability solutions — Datadog, Honeycomb, and others.
- List of GPU clusters for rent. a list of entire clusters that can be rented on an hourly basis.
- Mamba: The Hard Way. A detailed description of how Mamba works
- new benchmark for large language models. It’s a collection of nearly 100 tests I’ve extracted from my actual conversation history with various LLMs.
- BoCoEL. Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) is 10 times faster with just a few lines of modular code.
- FiT: Flexible Vision Transformer for Diffusion Model. This repo contains PyTorch model definitions, pre-trained weights, and sampling code for our flexible vision transformer (FiT). FiT is a diffusion transformer-based model that can generate images at unrestricted resolutions and aspect ratios.
- RobustVLM. To defend multi-modal models like OpenFlamingo and LLaVA against visual adversarial assaults, a novel technique is presented in this study. The authors successfully defend these models against manipulative picture assaults by fine-tuning the CLIP visual encoder in an unsupervised way, increasing the models’ dependability and security in practical applications without requiring complete model retraining.
- HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings. A popular benchmark called Holistic Evaluation of Language Models (HELM) was issued by the Stanford language modeling group. Additionally, they created HELM-Instruct, a version for instruction following. It is absolute, open-ended, and multifaceted.
- LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4. We’re excited to release LoRA Land, a collection of 25 fine-tuned Mistral-7b models that consistently outperform base models by 70% and GPT-4 by 4–15%, depending on the task. This collection of specialized fine-tuned models–all trained with the same base model–offers a blueprint for teams seeking to efficiently and cost-effectively deploy highly performant AI systems.
- Multimodal LLM’s Ability to Understand Visual Data. A new tool called ChartX is designed to assess how well multi-modal large language models (MLLMs) can understand and make sense of visual charts.
- A Critical Evaluation of AI Feedback for Aligning Language Models. The efficacy of integrating reinforcement learning with supervised fine-tuning in training is questioned in this repository. The more involved two-step technique can be outperformed by first training with a more sophisticated model, such as GPT-4.
- MMCSG Dataset. The MMCSG (Multi-Modal Conversations in Smart Glasses) dataset comprises two-sided conversations recorded using Aria glasses, featuring multi-modal data such as multi-channel audio, video, accelerometer, and gyroscope measurements. This dataset is suitable for research in areas like automatic speech recognition, activity detection, and speaker diarization.
- MultiLora inference server. One base model can have many LoRAs hot-swapped onto it using the Lorax inference server. This allows a large variety of model tunes to be supported with a significant reduction in RAM use.
- GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations. GTBench is a language-driven environment, evaluating the strategic reasoning limitations of LLMs through game-theoretic tasks. GTBench is built on top of OpenSpiel, supporting 10 widely-recognized games
- CrewAI. A library called CrewAI is available for creating and managing AI agents that make use of Replit and LangChain. It offers an easy-to-integrate modular setup comprising tasks, agents, crews, and tools for a variety of applications. LangSmith improves performance insights into non-deterministic LLM calls while streamlining the debugging process.
- gemma.cpp. gemma.cpp is a lightweight, standalone C++ inference engine for the Gemma foundation models from Google.
- MMedLM. The official codes for “Towards Building Multilingual Language Model for Medicine”.
- LLM Evaluation Metrics for Labeled Data. How to measure the performance of LLM applications with ground truth data.
Perspectives
- The data revolution in venture capital. Investors, data scientists, and tool builders leading the data-driven future of venture capital.
- The Three C’s: Creativity, Collaboration, and Communication. The way we communicate, work together, and complete creative projects has changed significantly since the invention of computing. With AI, we’re beginning to witness the commencement of another significant change. We undervalue how significant this change will be. Businesses that integrate artificial intelligence (AI) into their products from the start will have a significant edge over those who add it later to already-existing goods.
- Inside OpenAI Logan Kilpatrick (head of developer relations). Have you ever wondered how OpenAI develops and innovates so quickly? The head of developer relations at OpenAI, Logan Kilpatrick, talks about the company’s decision-making structure for product launches, high agency and urgency, and OpenAI’s distinct culture in this podcast.
- Mind-reading devices are revealing the brain’s secrets. Implants and other technologies that decode neural activity can restore people’s abilities to move and speak — and help researchers understand how the brain works.
- Generative AI’s environmental costs are soaring — and mostly secret. First-of-its-kind US bill would address the environmental costs of the technology, but there’s a long way to go.
- Strategies for an Accelerating Future. With Google’s Gemini providing a context window of over a million tokens and Groq’s hardware enabling almost instantaneous responses from GPT-3.5 models, among other recent advancements in AI, these represent a significant advancement in practical AI applications and highlight the pressing need for leaders to comprehend and adjust to the rapidly changing AI landscape.
- How to lose at Generative AI! Despite its excitement, generative AI is likely to let most startups down since it benefits established players with data advantages, established workflows, and the capacity to integrate AI without requiring significant system changes. A difficult road lies ahead for startups hoping to make a significant impact in the Generative AI space, even in spite of venture capital flooding the space. These startups are essentially preparing the market for incumbents who can readily adopt and integrate AI innovations into their dominant platforms by concentrating on expeditious engineering and UX improvements at the workflow layer.
- Stockholm declaration on AI ethics: why others should sign. The use of artificial intelligence (AI) in science has the potential to do both harm and good. As a step towards preventing harm, we have prepared the Stockholm Declaration on AI for Science.
- This is why the idea that AI will just augment jobs, never replace them, is a lie! AI will automate labor in certain areas. The response thus far has been divided: would increased efficiency allow for more human workers to accomplish the same duties, or will fewer workers be needed? This article compares and contrasts the effects of technology on manufacturing, agriculture, and the contemporary knowledge worker.
- LLM evaluation at scale with the NeurIPS Large Language Model Efficiency Challenge. After a year of breakneck innovation and hype in the AI space, we have now moved sufficiently beyond the peak of the hype cycle to start asking a critical question: are LLMs good enough yet to solve all of the business and societal challenges we are setting them up for?
Meme of the week
What do you think about it? Some news that captured your attention? Let me know in the comments
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn.
Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.