WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES
AI & ML news: Week 3–9 June
NVIDIA is valued 3 Trillion, Antrhopic and OpenAI interpreting LLMs, and much more
The most interesting news, repository, articles, and resources of the week
Check and star this repository where the news will be collected and indexed:
You will find the news first in GitHub. Single posts are also collected here:
Research
- Contextual Position Encoding: Learning to Count What’s Important. The general position encoding method can attend to the i-th particular word, noun, or sentence; it improves perplexity on language modeling and coding tasks; it is context-dependent and can represent different levels of position abstraction; it suggests a new position encoding method, CoPE, to enable the position to be conditioned on context by incrementing position only on certain tokens.
- Faithful Logical Reasoning via Symbolic Chain-of-Thought. suggests a way to enhance LLMs’ capacity for logical thinking by combining logical rules and symbolic expressions with chain-of-thought (CoT) prompting; this prompting method is known as Symbolic Chain-of-Thought and it is a fully LLM-based framework that consists of the following important steps: converts the context of natural language to symbolic format, 2) creates a step-by-step solution plan based on symbolic logical rules, and 3) employs a verifier to validate the translation and reasoning chain.
- Transformers Can Do Arithmetic with the Right Embeddings. The main problem this work addresses is the inability of transformers to track the exact position of digits; they do this by adding an embedding to each digit that encodes its position relative to the start of the number; these gains also transfer to multi-step reasoning tasks that include sorting and multiplication. achieves 99% accuracy on 100-digit addition problems by training on only 20-digit numbers with a single GPU.
- GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning. blends the reasoning powers of GNNs with the language understanding skills of LLMs in a RAG fashion; the GNN extracts relevant and useful graph information, and the LLM uses the information to answer questions over knowledge graphs (KGQA); GNN-RAG outperforms or matches GPT-4 performance with a 7B tuned LLM, and improves vanilla LLMs on KGQA.
- Attention as an RNN. is based on the parallel prefix scan algorithm, which enables efficient computation of attention’s many-to-many RNN output. It achieves comparable performance to Transformers on 38 datasets while being more time and memory-efficient. presents a new attention mechanism that can be trained in parallel (like Transformers) and updated with new tokens requiring constant memory usage for inferences (like RNNs).
- Are Long-LLMs A Necessity For Long-Context Tasks? suggests a reasoning framework to allow short-LLMs to handle long-context tasks by adaptively accessing and utilizing the context based on the tasks presented; it breaks down the long context into short contexts and processes them using a decision-making process. The argument makes the claim that long-LLMs are not necessary to solve long-context tasks.
- Sparse maximal update parameterization: A holistic approach to sparse training dynamics. All frontier model labs use muP, a potent tool, to transfer hyperparameters fine-tuned on tiny models to bigger, more costly training runs. This study investigates how to achieve that for sparse models, resulting in significantly better training results and lower computation expenses.
- Exploring Color Invariance through Image-Level Ensemble Learning. To address color bias in computer vision, researchers have created a novel learning technique called Random Color Erasing. By selectively excluding color information from training data, this technique strikes a balance between the significance of color and other parameters, producing models that perform better in challenging situations like industrial and wide-area surveillance.
- Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models. Conifer enhances LLMs’ comprehension of intricate instructions by utilizing a progressive learning methodology and a customized dataset.
- LLM Merging Competition: Building LLMs Efficiently through Merging. Sakana AI is sponsoring the LLM Merging challenge at NeurIPS this year.
- Tribeca to Screen AI-Generated Short Films Created by OpenAI’s Sora. Short films generated by artificial intelligence are popping up at more and more film festivals, and the largest event yet is dedicating an entire section to AI-generated movies.
- Adapting Large Multimodal Models to Distribution Shifts: The Role of In-Context Learning. A technique called InvariantSelectPR is intended to make Large Multimodal Models (LMMs) more adaptive in domain-specific fields such as healthcare.
- TAIA: Large Language Models are Out-of-Distribution Data Learners. A technique called TrainAllInfAttn improves the performance of big language models in niche markets with little data.
- MegActor: Harness the Power of Raw Video for Vivid Portrait Animation A new model called MegActor uses unprocessed driving videos to create more lifelike portrait animation. It addresses identity leaking and background interference and produces remarkable results with unique data creation framework and background encoding approaches.
- MeshXL: Neural Coordinate Field for Generative 3D Foundation Models. MeshXL is a new model that generates high-quality 3D meshes.
- Position-Guided Prompt Learning for Anomaly Detection in Chest X-Rays. Position-guided Prompt learning method for Anomaly Detection in chest X-rays (PPAD). PPAD leverages learnable text prompts and image prompts to minimize the gap between pre-training data and task-specific data. Through position-guided prompts, the model can focus on various regions, simulating the diagnostic process of experts.
- Tree Diffusion: Diffusion Models For Code. Wonderful diffusion paper that diffuses picture code. As part of the diffusion process, it has the ability to directly edit. Although it is sluggish, it can be simply used with search to significantly increase one’s capacity for reasoning.
- Improved Techniques for Optimization-Based Jailbreaking on Large Language Models. Expanding upon the Greedy Coordinate Gradient (GCG) approach, researchers have enhanced methods for optimization-based jailbreaking of huge language models.
- ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation. A training-free video interpolation technique for generative video diffusion models has been developed by researchers. This novel method improves frame rates without requiring a lot of training or big datasets and works with different models.
- A whole-slide foundation model for digital pathology from real-world data. Prov-GigaPath, a whole-slide pathology foundation model pre-trained on 1.3 billion 256 × 256 pathology image tiles in 171,189 whole slides. To train Prov-GigaPath, we propose GigaPath, a novel vision transformer architecture for pretraining gigapixel pathology slides. We further demonstrate the potential of Prov-GigaPath on vision–language pretraining for pathology by incorporating the pathology reports. In sum, Prov-GigaPath is an open-weight foundation model that achieves state-of-the-art performance on various digital pathology tasks, demonstrating the importance of real-world data and whole-slide modeling.
- DreamMat: High-quality PBR Material Generation with Geometry- and Light-aware Diffusion Models. Using Dream Mat to enhance 3D object texture production is a brilliant idea. Given a 3D model, it employs several traditional graphic methods including Metallic, Roughness, and Albedo to generate a very appealing result.
- LlamaCare: A Large Medical Language Model for Enhancing Healthcare Knowledge Sharing. To solve classification problems in large language models (LLMs), researchers have developed LlamaCare, a refined LLM for medical information, in conjunction with Extended Classification Integration (ECI).
- XRec: Large Language Models for Explainable Recommendation. XRec is a framework independent of models that improve explainable recommender systems by utilizing the language capabilities of huge language models.
- MetaMixer Is All You Need. Using simple convolutions, researchers have created a novel method called FFNification that preserves the query-key-value structure while converting self-attention processes into more effective token mixers.
- GrootVL: Tree Topology is All You Need in State Space Model. By dynamically constructing a tree topology based on spatial correlations and input information, GrootVL is a network that enhances state space models.
- ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization. In order to increase Visual Geo-localization (VG) and boost its performance in applications such as SLAM, augmented reality, and autonomous driving, researchers have created a new two-stage training process.
- ReLUs Are Sufficient for Learning Implicit Neural Representations. A review of the application of ReLU activation functions to implicit neural representations (INRs) learning has been conducted. They countered spectrum bias by introducing basic limitations to ReLU neurons, which were inspired by second-order B-spline wavelets.
News
- OpenAI Is Restarting Its Robotics Research Group. The San Francisco-based company has been a pioneer in generative artificial intelligence and is returning to robotics after a three-year break.
- AI Overviews: About last week. In order to improve search results and give users more precise and pertinent information, particularly for complex inquiries, Google created AI Overviews. While there were certain problems, such as incorrect results, and misread content, Google has fixed these difficulties with over a dozen technical updates, like improving the identification of absurd questions and reducing the amount of user-generated content in AI Overviews.
- Nvidia is said to be prepping an AI PC chip with Arm and Blackwell cores. Competition could be heating up in the Windows on Arm space amid talk in the industry that Nvidia is readying a chip pairing next-gen Arm cores with its Blackwell GPU architecture.
- Ex-OpenAI board member reveals what led to Sam Altman’s brief ousting . In a recent interview, former OpenAI board member Helen Toner provided fresh information into the circumstances surrounding CEO Sam Altman’s November dismissal. It appears that the board was informed via Twitter about the release of ChatGPT. According to Toner, Altman had repeatedly lied to the board. It has been alleged that Altman had been lying about events within the organization for years and hiding facts. The board found it difficult to make decisions as a result of his lies, and they concluded that he wasn’t the best person to take the firm to AGI.
- AI hardware firm Nvidia unveils next-gen products at Taiwan tech expo. CEO Jensen Huang tells packed stadium in Taipei ‘next Industrial Revolution has begun
- AMD unveils new AI chips to compete with Nvidia.AMD has been vying to compete against Nvidia, which currently dominates the lucrative market for AI semiconductors and commands about 80% of its share.
- Anthropic’s Claude 3 Opus and tool use are generally available on Vertex AI. Google Cloud now offers Claude 3 Opus with tool use along with the smaller models as part of its Vertex AI offering.
- State Space Duality (Mamba-2). Mambda is an effective model of state space. A lengthy and comprehensive explanation of the model and its enhancements is included in the second version that its team has issued.
- No physics? No problem. AI weather forecasting is already making huge strides. With AI models like WindBorne’s WeatherMesh, which leverages the extensive ERA5 dataset to outperform conventional models while using much less processing power, the weather forecasting industry is undergoing a transformation.
- Amazon’s Project PI AI looks for product defects before they ship. Project PI combines computer vision and generative AI to catch damaged items and prevent returns.
- The Opaque Investment Empire Making OpenAI’s Sam Altman Rich. One of Silicon Valley’s most active and successful individual investors is Sam Altman. At the beginning of this year, his stakes in his investment empire were valued at least $2.8 billion. A large portion of the portfolio is unknown. Readers are guided through Altman’s investment knowledge in this article.
- Even the Raspberry Pi is getting in on AI. Raspberry Pi partnered with Hailo to provide an optional AI add-on to its microcomputers.
- Using AI to decode dog vocalizations. Leveraging a human speech model to identify different types of barks. University of Michigan researchers are exploring the possibilities of AI, developing tools that can identify whether a dog’s bark conveys playfulness or aggression.
- The future is … sending AI avatars to meetings for us, says Zoom boss. Eric Yuan suggests technology is five or six years away and will free up time to spend with family
- AI researchers build ‘future self’ chatbot to inspire wise life choices. Scientists at MIT hope talking to 60-year-old self will shift thinking on health, money, and work
- Cartwheel generates 3D animations from scratch to power up creators. Animating a 3D character from scratch is generally both laborious and expensive, requiring the use of complex software and motion capture tools.
- Mistral launches fine-tuning API. Mistral has launched customization for its models via its platform and API.
- If you aren’t seeing AI Overviews in your search results, it’s probably thanks to Google. After receiving heavy criticism since their mid-May public launch, AI Overviews in Google Search have dropped in visibility across search results. Since I/O, the average percentage of queries where AI Overviews appear has dropped from 27 percent to just 11 percent. Despite the reduction, healthcare-related queries a large percentage of AI results, raising concerns about both accuracy and reliability across Google.
- Google optimizes shipping routes. The mathematical optimization for cargo shipping routes was enhanced by Google’s operations research group. They discovered a 13% drop in gasoline expenses and consumption.
- BrightEdge Releases Post Google I/O Data on The Impact of AI Overviews. The main businesses affected by AI Overviews, what generates results, and where Google automatically anticipates and responds to search inquiries are all revealed by new research from BrightEdge Generative Parser.
- Nvidia emails: Elon Musk diverting Tesla GPUs to his other companies. The Tesla CEO is accused of diverting resources from the company again. Elon Musk is yet again being accused of diverting Tesla resources to his other companies. This time, it’s high-end H100 GPU clusters from Nvidia.
- Securing Research Infrastructure for Advanced AI.In its description of the security architecture of its AI training supercomputers, OpenAI highlights the use of Azure-based infrastructure and Kubernetes for orchestration to safeguard critical model weights and other assets.
- Extracting Concepts from GPT-4. The team at OpenAI has discovered 16 million interpretable features in GPT-4 including price increases, algebraic rings, and who/what correspondence. This is a great step forward for SAE interpretability at scale. They shared the code in a companion GitHub repository.
- Mesop: Gradio Competition. A rival to the well-liked AI prototyping framework Gradio has been made available by Google. Gradio is more mature than Mesop, which is pure Python and slightly more composable.
- Nvidia is now more valuable than Apple at $3.01 trillion. The AI boom has pushed Nvidia’s market cap high enough to make it the second most valuable company in the world.
Resources
- An Introduction to Vision-Language Modeling. we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them.
- Aya 23: Open Weight Releases to Further Multilingual Progress. a family of multilingual language models with up to 23 languages supported; it demonstrates that it can perform better on those particular languages than other large-scale multimodal models by purposefully concentrating on fewer languages and allocating greater capacity to them.
- Financial Statement Analysis with Large Language Models claims that by analyzing trends and financial ratios, LLMs can produce insightful insights; demonstrate that GPT-4 outperforms more specialized models; and develop a profitable trading strategy based on GPT’s predictions.
- SimPO: Simple Preference Optimization with a Reference-Free Reward. SimPO demonstrates how it outperforms other methods like DPO and claims to generate the strongest 8B open-source model. It is a more straightforward and efficient method for preference optimization with a reference-free reward; it uses the average log probability of a sequence as an implicit reward (i.e., no reference model required), which makes it more compute and memory efficient.
- Experimenting with local alt text generation. A model that runs in the browser and can provide alt text for web photos automatically has been trained by Mozilla.
- Mora: More like Sora for Generalist Video Generation. Mora is a multi-agent framework designed to facilitate generalist video generation tasks, leveraging a collaborative approach with multiple visual agents. It aims to replicate and extend the capabilities of OpenAI’s Sora.
- FABRIC: Personalizing Diffusion Models with Iterative Feedback.FABRIC (Feedback via Attention-Based Reference Image Conditioning) is a technique to incorporate iterative feedback into the generative process of diffusion models based on StableDiffusion.
- KL is All You Need. KL divergence is a quick, affordable, and effective method of measuring a certain type of distance between objects. In both conventional and contemporary AI, it is widely employed. This piece examines the potent idea both mathematically and graphically.7
- Ways AI-Native Companies Can Improve User Retention. a manual with examples of how businesses like Perplexity, Civit, Lapse, Omnivore, and others are using them to increase retention for founders and product executives.
- FineWeb: decanting the web for the finest text data at scale. The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. Recently, we released 🍷 FineWeb, a new, large-scale (15-trillion tokens, 44TB disk space) dataset for LLM pretraining. FineWeb is derived from 96 CommonCrawl snapshots and produces better-performing LLMs than other open pretraining datasets.
- An entirely open-source AI code assistant inside your editor. Continue enables you to easily create your own coding assistant directly inside Visual Studio Code and JetBrains with open-source LLMs. All this can run entirely on your own laptop or have Ollama deployed on a server to remotely power code completion and chat experiences based on your needs.
- MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. A popular benchmark for reasoning tasks is MMLU. It is frequently seen as the gold standard and as something that models overfit. A new, more rigorous, and refined benchmark called MMLU Pro is used to gauge language model reasoning.
- Omost.Omost gives you control over how your images are generated. It comes from the same designer as ControlNet. First, it rewrites the prompts into a collection of illustrative code. After that, it renders the finished image using that. Crucially, you can modify the code either prior to or following generation in order to subtly alter the model’s output.
- Control-GIC. A novel generative image compression framework called Control-GIC enables fine-grained bitrate modification while preserving high-quality output.
- LLM inference speed of light. Using the theoretical speed of light modeling as grounding is extremely significant for problems where the amount of computation and memory access is known a priori as it helps assess the quality of implementations and predict the impact of architectural modifications.
- Neural Surface Reconstruction. Without the need for 3D supervision, GenS is an end-to-end generalizable neural surface reconstruction model that performs exceptionally well at reconstructing surfaces from multi-view images.
- MatMul-Free LM. Even at the billion-parameter scale, researchers have managed to remove matrix multiplication (MatMul) from huge language models without sacrificing speed.stable-audio-open-1.0 .The weights for Stable Audio, which was trained to produce sound effects on audio samples with permissive licenses, have been released by Stability AI.
- CV-VAE: A Compatible Video VAE for Latent Generative Video Models. With its spatio-temporally compressed latent spaces, CV-VAE is a video VAE that works with current image and video models to efficiently train new ones utilizing pre-trained ones.
- Qwen2. Pretrained and instruction-tuned models of 5 sizes, including Qwen2–0.5B, Qwen2–1.5B, Qwen2–7B, Qwen2–57B-A14B, and Qwen2–72B.Having been trained on data in 27 additional languages besides English and Chinese. Having been trained on data in 27 additional languages besides English and Chinese. State-of-the-art performance in a large number of benchmark evaluations
- Dragonfly: A large vision-language model with multi-resolution zoom. We are also launching two new open-source models Llama-3–8b-Dragonfly-v1 a general-domain model trained on 5.5 million image-instruction pairs and Llama-3–8b-Dragonfly-Med-v1 finetuned on additional 1.4 biomedical image-instruction data. Dragonfly demonstrates promising performance on vision-language benchmarks like commonsense visual QA and image captioning. Dragonfly-Med outperforms prior models, including Med-Gemini on multiple medical imaging tasks, showcasing its capabilities for high-resolution medical data.
- MMLU Pro. The industry standard for assessing knowledge and reasoning in language models is MMLU.
Perspectives
- Beyond the Cloud: Distributed AI and On-Device Intelligence. The transition of AI workflows from the cloud to the edge with specialized chip infrastructure & models, multi-modality, and ambience across devices
- Sure, Google’s AI overviews could be useful — if you like eating rocks. The company that shaped the development of search engines is banking on chatbot-style summaries. But so far, its suggestions are pretty wild
- AI’s Communication Revolution: We’re All Talking to Computers Now. With its real-time integration of text, vision, and audio, OpenAI’s GPT-4o is driving a revolution in communication through AI. As a result, human-to-AI communication has become a fundamental form of digital connection and has the potential to bring about substantial societal changes as well as the emergence of new companies focused on AI-centric communication. This transition makes it possible for more natural interactions with AI.
- A Right to Warn about Advanced Artificial Intelligence. A group of AI workers, both present and past, is pleading with advanced AI companies to adopt values that guarantee openness and safeguard workers who voice concerns about risks. They emphasize how important it is for businesses to refrain from enforcing non-disparagement agreements, to make anonymous reporting procedures easier, to encourage candid criticism, and to shield whistleblowers from reprisals.
- Will Scaling Solve Robotics? The Conference on Robot Learning, which included 11 workshops and nearly 200 submitted papers, drew over 900 attendees last year. Whether it was possible to tackle robotics problems by training a huge neural network on a large data set was one of the main points of contention throughout the event. To help readers better comprehend the topic, this piece offers the opposing viewpoints. Scaling has been successful in several related domains. It is not feasible, though, because there is a lack of readily available robotics data and no obvious method for obtaining it. Scaling, even if it performs as well as it does in other domains, is probably not going to solve robotics.
- Plentiful, high-paying jobs in the age of AI.Due to comparative advantage, it’s feasible that a large number of professions that humans currently perform will be performed by humans eternally, regardless of how much better AIs become at those tasks.
- What I learned from looking at 900 most popular open source AI tools. The goal of this study of open-source AI repositories is to provide readers with a broad overview of the intimidating AI ecosystem.
- Meta AI system is a boost to endangered languages — as long as humans aren’t forgotten. Automated approaches to translation could provide a lifeline to under-resourced languages, but only if companies engage with the people who speak them.
- Misinformation poses a bigger threat to democracy than you might think. In today’s polarized political climate, researchers who combat mistruths have come under attack and been labeled as unelected arbiters of truth. But the fight against misinformation is valid, warranted, and urgently required.
- Is AI misinformation influencing elections in India? A sample of roughly two million WhatsApp messages highlights urgent concerns about the spread and prevalence of AI-generated political content.
- I’m Bearish OpenAI.A shift toward products and a research brain drain should ring your alarm bells
- The future of foundation models is closed-source. if the centralizing forces of data and computing hold, open and closed-source AI cannot both dominate long-term
- A Grand Unified Theory of the AI Hype Cycle. Over the years, the AI sector has experienced multiple hype cycles, each of which produced really useful technology and outlasted the previous one. Instead of following an exponential process, every cycle adheres to a sigmoid one. There is an inevitable limit to any technology development strategy, and it is not too difficult to find. Although this AI hype cycle is unlike any other that has come before it, it will probably go in the same direction.
- Hi, AI: Our Thesis is on AI Voice Agents. The current state of AI speech agents is described in a blog post and deck created by Andreessen Horowitz, along with potential areas for advancement and investment. It outlines the present state of the B2B and B2C application layer landscape and covers the top infrastructure stack.
Medium articles
A list of the Medium articles I have read and found the most interesting this week:
- LucianoSphere (Luciano Abriata, PhD), “Sparks of Chemical Intuition” — and Gross Limitations! — in AlphaFold 3
- LuxinZ, KAN it rediscovery gravity?
- Valerie, How To Create Images that Sound With Diffusion Models
- Konstantin Rink, Supercharging LLMs with Fresh Data
- Sam Vaseghi, The Vanguard Quartet: The Four Newest Blockbuster Books on AI Advances
- Matt Nguyen, Building CLIP From Scratch
- Vinodh Kumar Ravindranath, An elegant (yet simple) technique to improve RAG quality
- Mandar Karhade, MD. PhD., How to Optimize Chunk Size for RAG in Production?
Meme of the week
What do you think about it? Some news that captured your attention? Let me know in the comments
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.
Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.
or you may be interested in one of my recent articles: