WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 20–26 May

Google AI summary issue, Apple's new features, OpenAI changing the new model voice, and much more

Salvatore Raieli
18 min readMay 28, 2024
Photo by Hayden Walker on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. Single posts are also collected here:

Weekly AI and ML news - each week the best of the field

44 stories

Research

  • LoRA Learns Less and Forgets Less. LoRA is a well-liked technique for enhancing models to add flair or expertise. The trade-off between forgetting and power while utilizing LoRAs is examined in this research. LoRAs are found to retain more of the initial “out of distribution” performance while learning less than full fine-tuning.
  • Chameleon: Mixed-Modal Early-Fusion Foundation Models. Like GPT-4o, Meta has unveiled Chameleon, a natively multi-modal model that works with both text and graphics at the same time. It performs better than a lot of other models. Since then, the Meta team’s work on internal models has greatly advanced.
https://arxiv.org/pdf/2405.09673
  • Mapping the Mind of a Large Language Model. Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first-ever detailed look inside a modern, production-grade large language model. This interpretability discovery could, in future, help us make AI models safer.
  • Smart Expert System: Large Language Models as Text Classifiers. Text classification is a fundamental task in Natural Language Processing (NLP), and the advent of Large Language Models (LLMs) has revolutionized the field. This paper introduces the Smart Expert System, a novel approach that leverages LLMs as text classifiers.
https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/

News

https://arxiv.org/pdf/2405.13063
https://crfm.stanford.edu/2024/05/21/fmti-may-2024.html
https://github.com/metaskills/experts
https://crfm.stanford.edu/2024/05/21/fmti-may-2024.html
https://blog.roboflow.com/paligemma-multimodal-vision/
https://huggingface.co/spaces/Xenova/experimental-moondream-webgpu
  • Nvidia reports stratospheric growth as the AI boom shows no sign of stopping. Chipmaker reports strong demand and higher-than-expected revenue even as other companies spend to develop their own chips
  • Mistral AI and Harvey Partnership. Mistral and Harvey, a legal company, have teamed. Although there aren’t many specifics in the statement, it’s likely that they will collaborate to create a unique legal paradigm.
  • French AI startup H raises $220M seed round. H, a startup based in Paris and previously known as Holistic AI, announced a $220 million seed round just a few months after the company’s inception.
  • Reflections on our Responsible Scaling Policy. With an emphasis on continuous improvement and cooperation with business and government, Anthropic’s Responsible Scaling Policy attempts to prevent catastrophic AI safety failures by identifying high-risk capabilities, testing models often, and enforcing tight safety requirements.
  • Introducing Aya. A global initiative led by Cohere For AI involving over 3,000 independent researchers across 119 countries. Aya is a state-of-the-art model and dataset, that pushes the boundaries of multilingual AI for 101 languages through open science.
https://blog.roboflow.com/paligemma-multimodal-vision/

Resources

  • model-explorer. A new model explorer from Google makes it simple to see the computation graph of your models. Performance engineering and debugging may find use for it.
  • real-time inference demo for paligemma. You may run the latest recent Google VLM in real-time by using GPT-Fast. Given how simple it is to fine-tune the model for particular jobs, this opens up a multitude of powerful downstream activities.
  • Multi AI Agent Systems using OpenAI’s Assistants API (Experts.js). Experts.js is the easiest way to create and deploy OpenAI’s Assistants and link them together as Tools to create a Panel of Experts system with expanded memory and attention to detail.
https://arxiv.org/pdf/2405.09818
  • First-ever AI Code Interpreter for R. Julius is the leading generative AI tool for data analysis. Designed to perform statistical analysis, data science, and computational tasks, it combines cutting-edge foundational models like GPT-4o, Claude 3, and Gemini 1.5 with robust coding capabilities in Python and R.
  • Moondream WebGPU. 1.86 billion parameter VLM (Vision-Language Model) that is optimized for inference on the web. Once downloaded, the model (1.8 GB) will be cached and reused when you revisit the page. Everything runs directly in your browser using 🤗 Transformers.js and ONNX Runtime Web, meaning your conversations aren’t sent to a server. You can even disconnect from the internet after the model has loaded!
  • Devon: An open-source pair programmer. You can select different models for Multi-file editing, Codebase exploration, Config writing, Test writing, Bug fixing, and Architecture exploration
  • llama3 implemented from scratch. In this file, it implemented llama3 from scratch, one tensor and matrix multiplication at a time. also, going to load tensors directly from the model file that Meta provided for llama3
https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf
  • PSG4D — 4D Panoptic Scene Graph Generation. The PSG4D (4D Panoptic Scene Graph Generation) Task is a novel task that aims to bridge the gap between raw visual inputs in a dynamic 4D world and high-level visual understanding. It involves generating a comprehensive 4D scene graph from RGB-D video sequences or point cloud video sequences.
  • microsoft/Phi-3-medium-128k-instruct. The Phi-3-Medium-128K-Instruct is a 14B parameter, lightweight, state-of-the-art open model trained with the Phi-3 datasets that include both synthetic data and the filtered publicly available website data with a focus on high-quality and reasoning-dense properties.
  • Debiasing Large Visual Language Models. Post-Hoc debias method and Visual Debias Decoding strategy. These strategies not only prove beneficial in minimizing hallucinations but also contribute to the generation of more helpful and precise illustrations
llama3 implemented from scratch
  • DeepSeek-VL. an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.
  • MiniCPM-V. MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. Models take images and text as inputs and provide high-quality text outputs. Since February 2024, we have released 4 versions of the model, aiming to achieve strong performance and efficient deployment
  • OLAPH: Improving Factuality in Biomedical Long-form Question Answering. A new benchmark dataset called MedLFQA was created to enhance the factual accuracy of long-form replies from big language models in the medical domain. OLAPH is a framework that uses preference optimization and automatic evaluations to teach LLMs to reduce errors.
https://arxiv.org/pdf/2405.10508
  • Tarsier. Tarsier, a new tool from Reworkd, visually tags webpage items with brackets and IDs to improve LLMs for online interface jobs. Through OCR-generated text representations, Tarsier enables an LLM without vision to comprehend the structure of a webpage, beating vision-language models in benchmarks.
  • mistralai/Mistral-7B-Instruct-v0.3. The Mistral-7B-Instruct-v0.3 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.3.
https://github.com/deepseek-ai/deepseek-vl
  • Distributed inference on Llama cpp. Distributed inference across several machines is now supported by Llama Cpp. Although it is now restricted to FP16, this is a significant step toward the deployment of open source.
  • Enhancing Long-Term Memory for Language Models. A novel method called Streaming Infinite Retentive LLM (SirLLM) aids large language models in retaining lengthier memory over the course of lengthy conversations.
  • Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering. Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs.

Perspectives

https://groundedscenellm.github.io/grounded_3d-llm.github.io/
  • What’s up with Llama 3? Arena data analysis. When it comes to open-ended creative activities, Meta’s Llama 3–70B language model outperforms competitors in the English Chatbot Arena, but it struggles with more technical suggestions. The results of the analysis show that Llama 3’s victory rate drops as the instructions get harder and that it excels at friendly, conversational responses. Even if Llama 3’s approachability might have helped it succeed, more research is needed to determine its true competitive advantage.
  • ChatGPT can talk, but OpenAI employees sure can’t. The stringent non-compete agreement (NDA) of OpenAI, which forbids former workers from criticizing the company for fear of forfeiting their invested ownership, has come to light with the exits of Ilya Sutskever and Jan Leike. In response to the article, CEO Sam Altman said there would be a correction.
  • AlphaFold3 — why did Nature publish it without its code? The latest iteration of the protein-structure-prediction algorithm AlphaFold has generated a great deal of interest since its release, accompanied by a paper in Nature, earlier this month. But its release has also prompted questions, and criticism, of both the AlphaFold team at Google DeepMind in London and Nature.
  • China’s ChatGPT: what a boom in Chinese chatbots means for AI. ChatGLM is one of hundreds of AI language models being developed for the Chinese language. It comes close to ChatGPT on many measures, say its creators.
https://arxiv.org/pdf/2405.10612v1
  • The Old-Fashioned Library at the Heart of the A.I. Boom. OpenAI’s remodeled mayonnaise factory headquarters, with its library-themed interior design, is a symbol of the company’s success with ChatGPT, which focuses on language. On the other hand, the office reminds people of the current legal disputes around the use of copyrighted content in AI training. The library is seen as a place for inspiration by OpenAI employees, despite these disagreements, which supports their conviction that AI-driven and human creativity can work together harmoniously.
  • Chaos and tension at OpenAI. Concerns over OpenAI’s dedication to AI safety have led to Ilya Sutskever’s departure, which could be concerning given that three other important employees have also quit recently. Concerns are raised regarding how the company’s marketing efforts may affect its nonprofit status and safety-focused purpose given these departures. These incidents might also have an impact on the legal and regulatory systems, drawing attention from Washington stakeholders.
  • AI is the reason interviews are harder now. This essay addresses how technical interview questions are becoming more complicated and how employers are expecting candidates to answer harder challenges faster. It emphasizes how non-technical users can benefit from using AI technologies like Ultracode to help them pass these kinds of interviews. The article recommends in-person interviews as a way to make sure applicants genuinely have the programming abilities required for the position.
https://www.anthropic.com/research/mapping-mind-language-model
  • What I’ve Learned Building Interactive Embedding Visualizations. An enthusiast for interactive embedding visualizations describes their well-honed process for producing these kinds of visuals, which illustrate the complex relationships between items depicted as points in three-dimensional areas. Data gathering, co-occurrence matrix construction, sparsification, PyMDE embedding, and 2D projection are the steps in the process that provide a clear visual representation. The author advocates for the accessibility and GPU-accelerated rendering capabilities of web apps by using them for the user interface.

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

--

--

Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence