WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

ML news: Week 22–28 January

Salvatore Raieli

14 min readJan 29, 2024

AI phones are coming, Google Chrome gains AI features and much more

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

GitHub — SalvatoreRa/ML-news-of-the-week: A collection of the the best ML news every week…

A collection of the the best ML news every week (research, news, resources) — GitHub — SalvatoreRa/ML-news-of-the-week…

github.com

You will find the news first in GitHub. Single posts are also collected here:

Salvatore Raieli

Weekly AI and ML news - each week the best of the field

View list

63 stories

Research

OMG-Seg: Is One Model Good Enough For All Segmentation? OMG-Seg can handle over ten different segmentation tasks in one framework, including image-level and video-level segmentation tasks, interactive segmentation, and open-vocabulary segmentation. To our knowledge, this is the first model to unify these four directions.

Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation. BriVIS, an approach that enhances open-vocabulary Video Instance Segmentation (VIS), was created by researchers. BriVIS achieves a more precise alignment between text and video by preserving the context of object motions across video frames through the use of a method known as Brownian Bridges.
Encoder-minimal and Decoder-minimal Framework for Remote Sensing Image Dehazing. A novel framework called RSHazeNet was created to eliminate haze from remote-sensing photos. The tool makes use of cutting-edge modules to enhance image comprehension and detail preservation, improving clarity and analytical use.
Supervised Fine-tuning in turn Improves Visual Foundation Models. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models.

Group Anything with Radiance Fields. Hierarchical grouping in 3D by training a scale-conditioned affinity field from multi-level masks
DiverseEvol. We introduce DiverseEvol, an efficient instruction-tuning method that allows the model itself to iteratively sample training subsets to improve its own performance, without any external supervision from humans or more advanced LLMs.
Unleashing the Power of Large-Scale Unlabeled Data. Depth Anything is trained on 1.5M labeled images and 62M+ unlabeled images jointly, providing the most capable Monocular Depth Estimation (MDE)
Prompt Highlighter: Interactive Control for Multi-Modal LLMs. By enabling users to highlight specific portions of prompts, researchers present the “Prompt Highlighter,” a technique that transforms text production in multi-modal language models.
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer. A novel generative model called MM-Interleaved is very good at handling and producing interleaved image-text data.

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation. A different preference optimization method is now being used in machine translation. For this job, it is more data-efficient than DPO. Crucially, the goal prevented the model from suggesting correct but inadequate translations, allowing it to perform competitively on WMT.
WARM: On the Benefits of Weight Averaged Reward Models. In RLHF, reward models are employed to simulate human desire; nevertheless, the model that is being aligned frequently “hacks the reward” and performs poorly. The resultant aligned model is favored 79% of the time over one aligned with a single reward model. This is achieved by combining numerous reward models that maintain a linear mode connection. Although model merging may be merely regularization, it has shown to be an effective training phase for the general language model pipeline and has performed fairly well in general models.
Benchmarking Large Multimodal Models against Common Corruptions. This technical study introduces MMCBench, a new benchmark created to evaluate large multimodal models’ (LMMs) consistency and dependability on a variety of tasks, including text-to-image and speech-to-text. It covers more than 100 well-known models with the goal of helping readers better comprehend how various AI systems function in practical situations.

Predicting multiple conformations via sequence clustering and AlphaFold2. AlphaFold2 has revolutionized structural biology by accurately predicting single structures of proteins. However, a protein’s biological function often depends on multiple conformational substates, and disease-causing point mutations often cause population changes within these substrates
HEDNet: A Hierarchical Encoder-Decoder Network for 3D Object Detection in Point Clouds. HEDNet is a novel encoder-decoder network that aims to improve autonomous cars’ ability to recognize 3D objects by tackling the problem of sparse point distribution in 3D situations.
Prompt Pool based Class-Incremental Continual Learning for Dialog State Tracking. This project proposes a novel prompt pool approach to recording the status of dialogs that do not need task IDs during testing, allowing it to adjust to changing user requirements.
DittoGym: Learning to Control Soft Shape-Shifting Robots. A major problem with soft robotics is the wide control space. In this study, a simulator with a variety of tasks for handling soft objects that resemble “dittos” is introduced. It includes several powerful baselines, visualization, and utilities.

SGTR+: End-to-end Scene Graph Generation with Transformer. A novel technique that researchers have created speeds up and improves the efficiency of the scene graph creation process. Their transformer-based approach aims to enhance the model’s comprehension and interconnection of many parts in a picture, resulting in enhanced performance on complex tasks.
DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data. Based on how similar two photographs are to one another, image similarity systems provide a score. This study builds upon earlier approaches, mainly by using artificial intelligence and human preferences.
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation. A model called SegMamba is intended for 3D medical image segmentation. In comparison to the Transformer architecture, it provides a more effective option.
SFC: Shared Feature Calibration in Weakly Supervised Semantic Segmentation. To improve semantic segmentation, researchers have created the Shared Feature Calibration (SFC) technique.

News

OpenAI’s Sam Altman Is Raising Money to Set Up AI Chip Factories. A new report reveals that OpenAI CEO Sam Altman is gearing up to raise money to set up his own network of AI chip factories.
Google DeepMind scientists in talks to leave and form AI startup. A pair of scientists at Google’s artificial intelligence subsidiary DeepMind is in talks with investors to form an AI startup in Paris, Bloomberg News reported on Friday, citing people familiar with the conversations.
The AI phones are coming. We’re tired of tapping through apps on our phones all day. Can Samsung show us an AI tool to save us?
How Microsoft found a potential new battery material using AI. Advances in AI and high-performance computing are changing the way scientists look for new battery materials.Google will pitch Bard Advanced as providing ‘complex, better responses’.At the start of December, Google said Gemini Ultra would launch in early 2024 and be available in “Bard Advanced.” When it launches, Google will position Bard Advanced as providing “complex, better responses.”

Stability AI unveils smaller, more efficient 1.6B language model as part of ongoing innovation. Stability AI, the vendor that is perhaps best known for its stable diffusion text to image generative AI technology, today released one of its smallest models yet, with the debut of Stable LM 2 1.6B.
Tesla finally releases FSD v12, its last hope for self-driving. Tesla has finally started releasing its FSD Beta v12 update to customers, which is sort of its last hope to deliver on its self-driving promises.
Code LoRA From Scratch. LoRA, which stands for Low-Rank Adaptation, is a popular technique to finetune LLMs more efficiently. Instead of adjusting all the parameters of a deep neural network, LoRA focuses on updating only a small set of low-rank matrices. This Studio explains how LoRA works by coding it from scratch, which is an excellent exercise for looking under the hood of an algorithm.
Microsoft’s Nadella Wants Stability at OpenAI, Not Control. In the midst of regulatory reviews in the EU and the UK, Microsoft CEO Satya Nadella is happy with the current condition of Microsoft’s cooperation with OpenAI, emphasizing stability above control. He highlights both Microsoft’s substantial funding in OpenAI and their own autonomous AI research.
ElevenLabs Releases New Voice AI Products and Raises $80M Series B. To strengthen its position in voice AI research and product development
Google Chrome gains AI features, including a writing helper, theme creator, and tab organizer. Google’s Chrome web browser is getting an infusion of AI technology in the latest release. The company announced today it’s soon adding a trio of new AI-powered features to Chrome for Mac and Windows, including a way to smartly organize your tabs, customize your theme, and get help when writing things on the web — like forum posts, online reviews, and more.
Anthropic researchers find that AI models can be trained to deceive. Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they’re exceptionally good at it.
Google shows off Lumiere, a space-time diffusion model for realistic AI videos . Lumiere, a space-time diffusion model proposed by researchers from Google, Weizmann Institute of Science and Tel Aviv University to help with realistic video generation.
Adept Fuyu-Heavy: A new multimodal model. Adept Fuyu-Heavy is a new multimodal model designed specifically for digital agents. In particular, Fuyu-Heavy scores higher on the MMMU benchmark than even Gemini Pro.

Report: Apple Making ‘Significant’ Push to Bring AI to iPhones.Apple is reportedly making a major push to bring artificial intelligence (AI) to the iPhone.
Hugging Face and Google partner for open AI collaboration. Today, we are thrilled to announce our strategic partnership with Google Cloud to democratize good machine learning. We will collaborate with Google across open science, open source, cloud, and hardware to enable companies to build their own AI with the latest open models from Hugging Face and the latest cloud and hardware features from Google Cloud.
OpenAI’s New embedding models and API updates. We are launching a new generation of embedding models, new GPT-4 Turbo and moderation models, new API usage management tools, and soon, lower pricing on GPT-3.5 Turbo.
Announcing Qdrant’s $28M Series A Funding Round. The firm behind the vector database, which powers some of ChatGPT and X’s “More like this,” has secured funds to enhance its corporate solutions and extend its Rust-based vector store.

Resources

nanotron. The objective of this library is to provide easily distributed primitives in order to train a variety of models efficiently using 3D parallelism.
DataTrove. DataTrove is a library to process, filter, and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.

CaptionIMG. A Simple program is written in Python to manually caption your images (or any other file types) so you can use them for AI training. I use it for Dreambooth training (StableDiffusion).
AI Toolkit. AI Toolkit is a header-only C++ library that provides tools for building the brain of your game’s NPCs.
Face Mixer Diffusion. This piece demonstrates how to clone faces in photos using diffusion. Although there are other methods for creating deep fakes, diffusion is intriguing since it allows for the necessary inpainting of other image elements.
Self-Rewarding Language Model. Implementation of the training framework proposed in the Self-Rewarding Language Model, from MetaAI
snorkelai/Snorkel-Mistral-PairRM-DPO. A powerful new Mistral tune that creates a DPO-compatible dataset by cleverly using poor supervision and synthetic data. Numerous iterations of the described procedure can be used for a broad range of corporate use cases.
nanoColBERT. ColBERT is a powerful late-interaction model that can perform both retrieval and reranking.
RPG-DiffusionMaster. RPG is a powerful training-free paradigm that can utilize proprietary MLLMs (e.g., GPT-4, Gemini-Pro) or open-source local MLLMs (e.g., miniGPT-4) as the prompt reception and region planner with our complementary regional diffusion to achieve SOTA text-to-image generation and editing. Our framework is very flexible and can generalize to arbitrary MLLM architectures and diffusion backbones.

Matrix Multiplication: Optimizing the code from 6 hours to 1 sec. A brief read about matrix multiplication optimizations particular to certain hardware and a generic procedure to accelerate AI programs.
SyncTalk: Mastering Realism in Talking Head Videos. A significant advancement in realistic talking head videos is SyncTalk. It solves earlier problems with lip motions, expressions, and facial identity synchronization.
Hallucination Leaderboard. Public LLM leaderboard computed using Vectara’s Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document. We plan to update this regularly as our model and the LLMs get updated over time.
Embedding English Wikipedia in under 15 minutes. Modal provides a serverless solution for organizations grappling with scaling workloads. Modal’s technology enables rapid scaling across many GPUs, which we can use to run large-scale workloads, such as generating embeddings for a massive text dataset, at lightning speed.
Concrete Steps to Get Started in Transformer Mechanistic Interpretability. Among the founders of Mechanistic Interpretability (MI) is Neel Nanda. This serves as his entry guide into the industry. It has two hundred specific open-ended questions. The research of language models’ quantitative values, or MI, involves actually examining neurons. Even though there hasn’t been much progress in this area of study yet, it is accessible because it doesn’t demand a lot of processing power.
The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation. SDD contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in the evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation, and music-language retrieval.
DiffMoog: A Modular Differentiable Commercial-like Synthesizer. This repo contains the implementation of DiffMoog, a differential, subtractive, modular synthesizer, incorporating standard architecture and sound modules commonly found in commercial synthesizers.
TensorDict. TensorDict is a dictionary-like class that inherits properties from tensors, such as indexing, shape operations, casting to device, or point-to-point communication in distributed settings. The main purpose of TensorDict is to make code bases more readable and modular by abstracting away tailored operations

Evaluation Metrics for LLM Applications In Production. How to measure the performance of LLM applications without ground truth data.
Asynchronous Local-SGD Training for Language Modeling. This repository contains a Colab notebook that presents a minimal toy example replicating the observed optimization challenge in asynchronous Local-SGD. The task is to perform classification on a mixture of mixtures of Gaussian data.
SpeechGPT: Speech Large Language Models. A novel speech synthesis model called SpeechGPT-Gen effectively manages the intricacies of language and voice traits.
LLM Steer. A Python module to steer LLM responses towards a certain topic/subject and to enhance capabilities (e.g., making it provide correct responses to tricky logical puzzles more often). A practical tool for using activation engineering by adding steer vectors to different layers of a Large Language Model (LLM). It should be used along with the Transformers library.
RoMa: A lightweight library to deal with 3D rotations in PyTorch. RoMa (which stands for Rotation Manipulation) provides differentiable mappings between 3D rotation representations, mappings from Euclidean to rotation space, and various utilities related to rotations. It is implemented in PyTorch and aims to be an easy-to-use and reasonably efficient toolbox for Machine Learning and gradient-based optimization.
AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agent. AgentBoard is a benchmark designed for multi-turn LLM agents, complemented by an analytical evaluation board for detailed model assessment beyond final success rates. The main Performance of different LLMs across various environments are shown below, please check our Results for more details.
makeMoE: Implement a Sparse Mixture of Experts Language Model from Scratch. This blog walks through implementing a sparse mixture of experts language model from scratch. This is inspired by and largely based on Andrej Karpathy’s project ‘makemore’ and borrows a number of reusable components from that implementation.

Perspectives

Text-to-Video: The Task, Challenges and the Current State. Text-to-video is next in line in the long list of incredible advances in generative models. How do these models work, how do they differ from text-to-image models, and what kind of performance can we expect from them?
My AI Timelines Have Sped Up (Again). In light of developments in scaling up models, the author updated their forecasts for the AI timetable. As of right now, they predict that artificial general intelligence will be achieved with a 10% probability by 2028 and a 50% likelihood by 2045. The efficacy of massive language models and the knowledge that numerous intelligent capabilities may arise at scale are credited with these changes.
Should The Future Be Human? Elon Musk and Larry Page have a deep disagreement over the possible risks associated with artificial intelligence. Page has called Musk a “speciesist” for favoring humans over digital life forms, which has caused a gap in their friendship. This demonstrates the necessity for careful and deliberate development of AI technology and reflects the larger discussion on the influence of AI, which includes worries about consciousness, individuation, art, science, philosophy, and the potential for mergers between humans and AI.

Computers make mistakes and AI will make things worse — the law must recognize that. A tragic scandal at the UK Post Office highlights the need for legal change, especially as organizations embrace artificial intelligence to enhance decision-making.
Google AI has better bedside manner than human doctors — and makes better diagnoses. Researchers say their artificial intelligence system could help to democratize medicine.
Tech developers must respect equitable AI access. We argue for a legal framework to ensure equitable access to artificial intelligence (AI) tools, such as ChatGPT, to avoid limiting their benefits to a privileged few
Seven technologies to watch in 2024. Advances in artificial intelligence are at the heart of many of this year’s most exciting areas of technological innovation

If AI Were Conscious, How Would We Know?. When discussing AI consciousness, references to Searle’s Chinese Room Thought Experiment and the Turing Test are frequently made. The former examines whether an AI’s conduct can be distinguished from that of a human, while the latter contends that exterior behavior is insufficient to demonstrate consciousness. Given that our knowledge of consciousness in AI is mostly derived from functionalist theories and human experiences, this argument emphasizes how difficult it is to define and identify consciousness in AI.
AI today and trends for an AI future. A survey of experts on: How are early adopters using AI today? Where is AI going in 2024?

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

GitHub — SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

Tutorials on machine learning, artificial intelligence, data science with math explanation and reusable code (in python…