WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES

AI & ML news: Week 1–7 July

Gemini on Apple, Grok 2 announced and much more

Salvatore Raieli
17 min readJul 9, 2024
Photo by Jon Tyson on Unsplash

The most interesting news, repository, articles, and resources of the week

Check and star this repository where the news will be collected and indexed:

You will find the news first in GitHub. Single posts are also collected here:

Weekly AI and ML news - each week the best of the field

44 stories

Research

  • LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs. claims to achieve 64.3% on HotpotQA (full-wiki), which is on par with the state-of-the-art model. proposes LongRAG, which combines RAG with long-context LLMs to enhance performance; uses a long retriever to significantly reduce the number of extracted units by operating on longer retrieval units; the long reader takes in the long retrieval units and leverages the zero-shot answer extraction capability of long-context LLMs to improve performance of the overall system.
  • From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data. suggests a fine-tuning strategy to increase the precision of information retrieval in LLMs while preserving reasoning abilities over long-context inputs; the fine-tuning dataset consists of 350 sample numerical dictionary key-value retrieval tasks; results show that this strategy reduces the “lost-in-the-middle” effect and enhances performance on both long-context reasoning and information retrieval.
  • GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models. enhances the long-context capabilities of LLMs by proposing a graph-based agent system that organizes long text into a graph and uses an agent to explore the graph (using predefined functions guided by a step-by-step rational plan) to efficiently generate answers to questions; consistently outperforms GPT-4–128k across context lengths ranging from 16k to 256k.
  • Following Length Constraints in Instructions. explains a method for addressing length bias and training language models that adhere to length constraints more closely; it refines a model using DPO using a dataset that has been augmented with length instructions and demonstrates fewer length constraint violations while maintaining a high response quality.
  • Adam-mini: Use Fewer Learning Rates To Gain More. a new optimizer that carefully divides parameters into blocks and assigns a single high-quality learning that outperforms Adam; it achieves consistent results on language models sized from 125M -7B for pre-training, SFT, and RLHF. It uses fewer learning rates, which results in a 45%–50% reduction in memory footprint while still performing on par or even better than AdamW.
  • MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data. generative image model with better performance than pure text conditioned models due to its ability to interleave text and images.
  • Scaling Synthetic Data Creation with 1,000,000,000 Personas. By treating web text as originating from a persona, this approach can significantly enhance job performance downstream by conditioning on that persona. The researchers find a jump of 20% points on MATH.
  • Odd-One-Out: Anomaly Detection by Comparing with Neighbors. A novel anomaly detection challenge has been presented by researchers that focus on things that appear unusual in comparison to other objects in the scene. In contrast to conventional techniques, anomalies in this case are distinctive to the scene and can be determined from several angles.
  • Adaptable Logical Control for Large Language Models. This approach enables the control of model generation at inference time, as well as interactive text editing. It achieves strong performance with tiny models and permits logical limitations in the generating process.
  • Pairwise Difference Learning for Classification. Scholars have expanded Pairwise Difference Learning (PDL), which was first developed as a regression method, to include classification tasks. PDL makes predictions about the differences between pairs of instances rather than the outcomes themselves.
  • AXIAL. This research improves the explainability of model decisions by putting forth a novel technique for identifying Alzheimer’s disease using 3D MRI scans.
  • Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization. A novel technique called Multi-Session SLAM creatively records camera movements throughout multiple disconnected video sequences using a single global frame of reference.

News

Resources

  • EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. improves the long-context capabilities of LLMs by putting forth a graph-based agent system that efficiently generates answers to questions by organizing long text into a graph and employing an agent to explore the graph (using predefined functions guided by a step-by-step reasonable plan); surpasses GPT-4–128k with consistency in context lengths between 16k and 256k.
  • On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. survey on LLM-based synthetic data generation, curation, and evaluation.
  • Text2Bricks: Fine-tuning Open-Sora in 1,000 GPU Hours. Lambda Labs trained the Open Sora video model on its 1-click cluster to create Lego movies.
  • Laplace Neural Operator. One architecture for approximating PDEs that is based on neural networks is the Laplace operator.
  • llama-agents. llama-agents is an async-first framework for building, iterating, and productionizing multi-agent systems, including multi-agent communication, distributed tool execution, human-in-the-loop, and more!
  • Suri: Multi-constraint Instruction Following for Long-form Text Generation. A collection of 20,000 lengthy documents and intricate instructions is called Suri. Its goal is to enhance AI’s capacity to adhere to intricate writing requirements. The Suri development team has presented Instructional ORPO (I-ORPO), an alignment technique that provides feedback through artificially damaged instructions.
  • Cambrian-1.High-performing, fully open vision model from NYU with significant improvements over text encoders and data mixtures.
  • DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability. A novel expressive text-to-speech (TTS) model called DEX-TTS makes use of reference speech to enhance style representation and model generalization.
  • Debugging in PyTorch. PyTorch is an excellent modeling tool. Nonetheless, a few prevalent issues have the ability to significantly lower model performance. Examining this list will aid you when debugging your model code.
  • vision-agent. Vision Agent is a library that helps you utilize agent frameworks to generate code to solve your vision task.
  • What to do to scale up? An amazing and surprisingly understandable post about fine-tuning hyperparameters as model and dataset sizes increase.
  • Web2Code. A novel procedure that researchers have created will enhance Web2Code instruction tweaking. It entails generating new text question-answer pairs, generating new webpage image-code pairs, improving webpage understanding data, and developing new webpage code generation pairs.
  • Block Transformer: Global-to-Local Language Modeling for Fast Inference. This repository presents a brand-new Transformer type with a significantly smaller KV cache size. Although it hasn’t been tested in large quantities, it should be able to perform on par with typical Transformers.
  • Composio. Equip your agent with high-quality tools & integrations without worrying about authentication, accuracy, and reliability in a single line of code!
  • Segment Anything without Supervision. Unsupervised SAM (UnSAM) is a ‘segment anything’ model for promptable and automatic whole-image segmentation which does not require human annotations.
  • Following Length Constraints in Instructions. Most models don’t adhere to length specifications (less than 40 words, for example). This piece demonstrates how to tune them to do that.
  • AI Overviews Research: Comparing pre and post-rollout results on 100K keywords. The prevalence of Google’s AI Overviews (AIO) feature, which typically links to the top 10 organic results, has significantly decreased from 64% pre-rollout to just 8.71% of SERPs for 100K keywords. Following the implementation, both the length of AIO material and the number of links have grown, demonstrating Google’s focus on thorough responses and reliable sources. In this dynamic search environment, where user searches with longer inquiries, lower search volumes, and lower CPC are more likely to result in AI-generated results, SEO techniques must change to stay relevant.
  • Meta 3D Gen. Meta has trained both a PBR texture creation system and an advanced 3D object generation model. It generates synthetic data by using the proprietary 2D picture-generating model of the company.
  • Mutahunter. An open-source, LLM-based mutation testing tool for automated software testing that is independent of language.
  • LLaRA: Large Language and Robotics Assistant. LLaRA is a framework that leverages conversation-style instruction-response pairings and Large Language Models (LLMs) to enhance robot action policy. These Vision Language Models (VLMs) use visual inputs to evaluate state data and produce the best possible policy choices.
  • MM-Instruct. MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
  • Parable of the Parser. Great keynote talk from CVPR.
  • InstantStyle-Plus : Style Transfer with Content-Preserving in Text-to-Image Generation. Style transfer with modern diffusion models and content embedders.
  • RSCaMa: Remote Sensing Image Change Captioning with State Space Model. A novel technique called RSCaMa has been presented by researchers to use natural language to describe changes in remote sensing photographs.
  • Simple Diffusion Language Models. Excellent talk about utilizing diffusion as a target for language modeling by Hugging Face researcher and Cornell Tech professor Sasha Rush.
  • 3D Reconstruction from Blurry Images. Researchers have created a technique that uses neural radiance fields (NeRF) and event streams to recreate three-dimensional sceneries from a single fuzzy image. This novel method eliminates the requirement for pre-computed camera poses by modeling camera motion and synthesizing brightness changes to produce high-quality, view-consistent images from hazy inputs.
  • Agentless. Agentless is an agentless approach to automatically solve software development problems. To solve each issue, Agentless follows a simple two-phase process: localization and repair.
  • MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention. A novel technique called inference speeds up the processing of lengthy cues in big language models. To get around the considerable delays brought on by conventional approaches, it makes use of sparse computation techniques
  • .torch.compile, the missing manual. Manual for resolving torch.compile errors to make your code run faster.
  • facebook/multi-token-prediction.Models for Meta’s multi-token prediction model were provided, and they performed incredibly well.
  • Maestro — A Framework for Claude Opus, GPT and local LLMs to Orchestrate Subagents. This Python script demonstrates an AI-assisted task breakdown and execution workflow using the Anthropic API. It utilizes two AI models, Opus and Haiku, to break down an objective into sub-tasks, execute each sub-task, and refine the results into a cohesive final output.
  • Magic Insert: Style-Aware Drag-and-Drop. Method from Google to introduce meaningful items into photos with diffusion. The demo and dataset are accessible.
  • Discrete Semantic Tokenization for Deep CTR Prediction. UIST is a unique method that transforms dense embeddings into discrete, compact tokens for user and item representations, therefore significantly improving click-through rate estimates.
  • CELLO: Causal Evaluation of Large Vision-Language Models. With 14,094 causal questions, CELLO is a new dataset designed to help AI understand causality beyond common sense thinking.
  • OpenStreetView-5M. With more than 5 million geotagged street photos from 225 countries, OpenStreetView-5M is a sizable open-access dataset aimed at evaluating computer vision techniques for picture localization.
  • PTQ4SAM: Post-Training Quantization for Segment Anything. A new framework called PTQ4SAM was created to lessen the memory and processing requirements of the large-scale Segment Anything Model (SAM).
  • Boosting Smartphone Camera Clarity. In this study, a self-supervised learning model that enhances reference-based super-resolution (RefSR) is used to present a technique for improving smartphone image resolution.
  • An Investigation of Incorporating Mamba for Speech Enhancement. SEMamba is a novel speech enhancement system that enhances voice signal clarity by utilizing the Mamba state-space model.
  • Florence 2 on WebGPU. The tiny vision model is fully functional within the onnx and WebGPU-based browser.
  • FlexiFilm: Long Video Generation with Flexible Conditions. A diffusion model called FlexiFilm was created expressly to produce long videos — more than 30 seconds — with excellent quality and consistency.

Perspectives

  • Smudgy chins, weird hands, dodgy numbers: seven signs you’re watching a deep fake. Look out for surplus fingers, compare mannerisms with real recordings and apply good old-fashioned common sense and skepticism, experts advise
  • Training MoEs at Scale with PyTorch.To write about scaling their MoE models to thousands of GPUs, the Mosaic team has teamed up with PyTorch.
  • Investing in the Age of Generative AI. Though there is currently a “euphoria” surrounding investment, the generative AI business is already showing signs of fragility.
  • Can AI boom drive Nvidia to a $4tn valuation despite investor doubt? Powerful new chips are on the way but there are questions over whether tech firm’s growth can be sustained
  • AI scaling myths. It is improbable that LLMs will ever be able to achieve AGI through scaling on its own. Although scaling has been found to improve model capabilities, it largely improves confusion instead of emergent skills. Getting hold of high-quality training data is getting harder and harder.
  • A discussion of discussions on AI bias. The nature of AI bias has come under more scrutiny, with detractors claiming that biases in machine learning are demonstrated by the way models like Playground AI occasionally change a user’s ethnicity in photos. Some users refute this as a flaw or pertinent prejudice, pointing to instances in which Asian traits are overrepresented. The discussion touches on the wider ramifications of AI bias in many businesses. There is no easy answer to this complicated problem.
  • The shape of information. This article describes how to use binary logic to maximize scarce resources.
  • why we no longer use LangChain for building our AI agents. Octomind’s codebase and team productivity increased after it eschewed the LangChain framework for AI test automation in favor of more straightforward, modular building parts. It found that the high-level abstractions of LangChain were rigid, making development and maintenance more difficult. Octomind now benefits from a leaner architecture and faster iteration for its AI agent duties as a result of changing strategy.
  • The Five Stages Of AI Grief. Benjamin Bratton, a professor at the University of California, San Diego and director of the Antikythera program at the Berggruen Institute, refers to the global response to artificial intelligence as a “Copernican Trauma,” comparing it to historical changes that have reshaped humanity’s understanding of itself. Bratton offers the following five stages of “AI grief” to describe how society would react to AI’s evolution: from skepticism to integration into our conception of intelligence: denial, rage, bargaining, depression, and acceptance. He contends that rather than being a uniquely human story, the integration of AI represents a larger biological and technological evolutionary process.
  • How to win at Enterprise AI — A playbook. This AI-focused playbook describes AI adoption methods for enterprises, emphasizing the move from human-performed services to software-driven workflows known as “Service-as-a-software.” It explores how these changes may affect business models, including performance-based pricing, and stresses how crucial workflow capture and AI accuracy are to the implementation process’s success. The handbook also covers threats such as lateral attacks and emphasizes that in enterprise contexts, AI must show real performance, not simply potential.
  • AI is disrupting Customer Support. Salesforce is feeling the pinch. Customer support software providers like Salesforce and Zendesk are facing challenges as enterprises redirect their IT spending toward AI proof-of-concept projects. For traditional software suppliers, the increasing integration of solutions such as ChatGPT in customer assistance has resulted in longer payback periods due to higher customer acquisition expenses. The creativity of these businesses and the overall macroeconomic climate will determine how much money is invested in customer support software in the future.
  • Contra Acemoglu on AI. In contrast to more positive projections, economist Daron Acemoglu’s working paper on AI proposes a modest 0.06% annual rise in TFP growth. He identifies four distinct ways that AI affects productivity, but he ignores the development of new labor-intensive goods and the further automation of existing processes, perhaps underestimating the economic potential of AI. His method is criticized for being unduly restrictive and for perhaps distorting the wider socioeconomic effects of AI developments.
  • Inside the maths that drives AI. Loss functions measure algorithmic errors in artificial intelligence models, but there’s more than one way to do that. Here’s why the right function is so important.
  • ‘The disruption is already happening!’ Is AI about to ruin your favorite TV show?I t won’t be long till everything from Drag Race to Keeping Up With the Kardashians could be written without humans — and you might be able to write yourself as the hero of a new show. But will robot TV ever be up to snuff?
  • Can the climate survive the insatiable energy demands of the AI arms race? New computing infrastructure means big tech is likely to miss emissions targets but they can’t afford to get left behind in a winner takes all market
  • Our attitudes towards AI reveal how we feel about human intelligence. We’re in the untenable position of regarding the AI as alien because we’re already in the position of alienating each other

Meme of the week

What do you think about it? Some news that captured your attention? Let me know in the comments

--

--

Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence