WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES
AI & ML news: Week 20–26 May
Google AI summary issue, Apple's new features, OpenAI changing the new model voice, and much more
The most interesting news, repository, articles, and resources of the week
Check and star this repository where the news will be collected and indexed:
You will find the news first in GitHub. Single posts are also collected here:
Research
- LoRA Learns Less and Forgets Less. LoRA is a well-liked technique for enhancing models to add flair or expertise. The trade-off between forgetting and power while utilizing LoRAs is examined in this research. LoRAs are found to retain more of the initial “out of distribution” performance while learning less than full fine-tuning.
- Chameleon: Mixed-Modal Early-Fusion Foundation Models. Like GPT-4o, Meta has unveiled Chameleon, a natively multi-modal model that works with both text and graphics at the same time. It performs better than a lot of other models. Since then, the Meta team’s work on internal models has greatly advanced.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. The technical report for Google’s most current model family has been updated. While there is a dearth of information regarding the models and data utilized, there is a wealth of information regarding the assessment and safety precautions implemented, providing an intriguing glimpse into large-scale alignment.
- Introducing the Frontier Safety Framework. Frontier Safety Framework was unveiled by Google DeepMind to mitigate the dangers associated with upcoming sophisticated AI models. This framework assesses models against critical capability levels (CCLs) for potentially dangerous AI capabilities and implements mitigation techniques when thresholds are crossed.
- ART3D: 3D Gaussian Splatting for Text-Guided Artistic Scenes Generation. AI can be creatively and entertainingly used to generate artistic 2D visuals. This work uses text-guided Gaussian Splatting to bring that capacity to 3D.
- Grounded 3D-LLM with Referent Tokens. It’s difficult to figure out where items are in a 3D setting. You can identify semantic labels for things in 3D space by employing language-guided 3D understanding.
- LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation. LeMeViT is a novel method that uses learnable meta tokens to lower the computational costs associated with Vision Transformers. By effectively capturing important data, these tokens accelerate inference.
- Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers. A fresh security risk has been identified for the well-known AI model Vision Transformers by researchers. The attack, known as SWARM, is extremely sneaky and harmful to consumers since it discreetly activates backdoor behavior in a model using a “switch token”.
- Mapping the Mind of a Large Language Model. Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first-ever detailed look inside a modern, production-grade large language model. This interpretability discovery could, in future, help us make AI models safer.
- Smart Expert System: Large Language Models as Text Classifiers. Text classification is a fundamental task in Natural Language Processing (NLP), and the advent of Large Language Models (LLMs) has revolutionized the field. This paper introduces the Smart Expert System, a novel approach that leverages LLMs as text classifiers.
- CSTA: CNN-based Spatiotemporal Attention for Video Summarization. In order to enhance video summarization, this project presents a novel CNN-based SpatioTemporal Attention (CSTA) technique. In contrast to conventional attention processes, CSTA uses a 2D CNN to efficiently extract the visual meaning of frames in order to comprehend relationships and important features in films.
- Microsoft introduces Phi-Silica, a 3.3B parameter model made for Copilot+ PC NPUs. Microsoft is making more investments in the development of small language models (SLMs). At its Build developer conference, the company announced the general availability of its Phi-3 models and previewed Phi-3-vision. However, on the heels of Microsoft’s Copilot+ PC news, it’s introducing an SLM built specifically for these device’s powerful Neural Processing Units (NPUs).
- Aurora: A Foundation Model of the Atmosphere. By training a foundation model for atmospheric predictions, Microsoft has achieved new state-of-the-art global weather prediction tests lasting five and ten days.
- MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark. A new benchmark called MathBench aims to give a comprehensive evaluation of the mathematical capabilities of large language models.
- Wav-KAN: Wavelet Kolmogorov-Arnold Networks. Wav-KAN is a neural network framework that leverages wavelet functions to enhance performance and interpretability, according to research. Wav-KAN captures both high-frequency and low-frequency data components, which speeds up training and boosts robustness in contrast to standard models.
- ProtT3: Protein-to-Text Generation for Text-based Protein Understanding. ProtT3, a novel framework that combines conventional Language Models (LMs) with Protein Language Models (PLMs) to improve text-based protein understanding, is presented by researchers. Using a cross-modal projector known as Q-Former, ProtT3 combines a PLM for analyzing amino acid sequences with a language model to produce high-quality textual descriptions.
- Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images. In order to better explain how the environment changes over time, a new probabilistic diffusion model for Remote Sensing Image Change Captioning (RSICC) is presented in this study.
News
- First companies sign up for AI safety standards on the eve of the Seoul summit. Rishi Sunak says 16 international firms have committed, but standards have been criticized for lacking teeth
- World is ill-prepared for breakthroughs in AI, say experts. Governments have made insufficient regulatory progress, ‘godfathers’ of the technology say before the summit
- Productivity soars in sectors of the global economy most exposed to AI, says the report. Employers in the UK, one of the 15 countries studied, are willing to pay a 14% wage premium for jobs requiring AI skills
- ChatGPT suspends Scarlett Johansson-like voice as actor speaks out against OpenAI. OpenAI says ‘Sky’ is not an imitation of an actor’s voice after users compare it to an AI companion character in the film Her
- $16k G1 humanoid rises up to smash nuts, twist and twirl. Humanoid development at Chinese robotics company Unitree continues apace. Following its entry into the melee just last year, its fast-walking H1 bot recently got its backflip groove on. Now the faceless and hand-less humanoid is being joined by an impressive all-rounder.
- Google I/O 2024: Here’s everything Google just announced. Google kicked off its developer conference each year with a rapid-fire stream of announcements, including many unveilings of recent things it’s been working on. Brian already kicked us off by sharing what we are expecting.
- Gamma raised $12M in Series A funding to reimagine presentations, powered by AI. Gamma received $12 million from Accel to use AI to reinvent presentations. Over 18 million people have contributed over 60 million Gammas (AI-generated slides) to date.
- Inflection AI reveals new team and plans to embed emotional AI in business bots. Inflection AI unveiled its new leadership team, composed of seasoned Silicon Valley veterans.
- Scarlett Johansson says Altman insinuated that AI soundalike was intentional. OpenAI has paused a voice mode option for ChatGPT-4o, Sky after backlash accusing the AI company of intentionally ripping off Scarlett Johansson’s critically acclaimed voice-acting performance in the 2013 sci-fi film Her.
- Perplexity CEO Aravind Srinivas takes shots at Google. Google’s planned roll-out of AI-summarized search results doesn’t faze Perplexity AI CEO and co-founder Aravind Srinivas — whose startup has offered a popular AI-driven search tool providing similar digests for nearly two years.
- Google still hasn’t fixed Gemini’s biased image generator. Back in February, Google paused its AI-powered chatbot Gemini’s ability to generate images of people after users complained of historical inaccuracies. Well, the problem’s likely more complex than Hassabis alluded to.
- SoundHound AI and Perplexity Partner to Bring Online LLMs to Next-Gen Voice Assistants Across Cars and IoT Devices. Perplexity’s capabilities added to SoundHound Chat AI will respond to questions conversationally with real-time knowledge from the web
- Stability AI discusses sale amid cash crunch, The Information reports. Artificial Intelligence startup Stability AI held discussions with at least one potential buyer in recent weeks about a sale as it faces a cash crunch, The Information reported on Wednesday, citing a person involved in the talks.
- Scale AI raises $1B. Accel and earlier investors provide the gigantic series F. There is a huge need for the services offered, and Scale is in a unique position to keep driving the current AI data surge.
- Elon Musk’s xAI is working on making Grok multimodal. Users may soon be able to input images into Grok for text-based answers.
- Google CEO Sundar Pichai on AI-powered search and the future of the web. The head of Google sat down with Decoder last week to talk about the biggest advancements in AI, the future of Google Search, and the fate of the web.
- Apple announces new accessibility features, including Eye Tracking, Music Haptics, and Vocal Shortcuts. Apple today announced new accessibility features coming later this year, including Eye Tracking, a way for users with physical disabilities to control iPad or iPhone with their eyes.
- Microsoft announces $3.3 billion investment in Wisconsin to spur artificial intelligence innovation and economic growth. Microsoft today announced a broad investment package designed to strengthen the role of Southeast Wisconsin as a hub for AI-powered economic activity, innovation, and job creation. These investments include $3.3B in cloud computing and AI infrastructure, the creation of the country’s first manufacturing-focused AI co-innovation lab, and an AI skilling initiative to equip more than 100,000 of the state’s residents with essential AI skills.
- ElevenLabs has launched a free iPhone app that speaks text on the screen — 11 voices and PDF capabilities available. The unicorn startup ElevenLabs, best known for its AI dubbing site, has launched its first public app.
- The US Congress is taking on AI — this computer scientist is helping. Kiri Wagstaff, who temporarily shelved her academic career to provide advice on federal AI legislation, talks about life inside the halls of power.
- OpenAI Partners with News Corp. News Corp, which publishes articles from WSJ, NYP, The Times, and other publications, and OpenAI have partnered to provide News Corp’s news material on OpenAI’s platform, which they say would improve generations’ accuracy and usability.
- Stanford HAI Releases Updated Foundation Model Transparency Index. The most recent version of Stanford HAI’s Foundation Model Transparency Index, which assesses the transparency of 14 significant AI developers, including Google and OpenAI, was released. These businesses showed a considerable improvement and readiness to engage in a dialogue about their models by disclosing fresh information that was not previously known to the public. The average transparency score was just 58 out of 100, indicating serious deficiencies in areas including downstream impact, data access, and model credibility despite these advancements.
- The ChatGPT desktop app is more helpful than I expected — here’s why and how to try it. Among OpenAI’s many big updates this week was a new ChatGPT app for MacOS. Here’s how to use it and when Windows users can get in on the fun.
- Suno has raised $125 million to build a future where anyone can make music. A platform for creating music called Suno has raised $125 million to keep constructing a world in which anyone can compose music.
- Nvidia reports stratospheric growth as the AI boom shows no sign of stopping. Chipmaker reports strong demand and higher-than-expected revenue even as other companies spend to develop their own chips
- Mistral AI and Harvey Partnership. Mistral and Harvey, a legal company, have teamed. Although there aren’t many specifics in the statement, it’s likely that they will collaborate to create a unique legal paradigm.
- French AI startup H raises $220M seed round. H, a startup based in Paris and previously known as Holistic AI, announced a $220 million seed round just a few months after the company’s inception.
- Reflections on our Responsible Scaling Policy. With an emphasis on continuous improvement and cooperation with business and government, Anthropic’s Responsible Scaling Policy attempts to prevent catastrophic AI safety failures by identifying high-risk capabilities, testing models often, and enforcing tight safety requirements.
- Introducing Aya. A global initiative led by Cohere For AI involving over 3,000 independent researchers across 119 countries. Aya is a state-of-the-art model and dataset, that pushes the boundaries of multilingual AI for 101 languages through open science.
- PaliGemma: An Open Multimodal Model by Google. PaliGemma is a vision language model (VLM) developed and released by Google that has multimodal capabilities. Unlike other VLMs, such as OpenAI’s GPT-4o, Google Gemini, and Anthropic’s Claude 3 which have struggled with object detection and segmentation, PaliGemma has a wide range of abilities, paired with the ability to fine-tune for better performance on specific tasks.
- Casper Labs Announces AI Governance Solution, Prove AI. In an effort to improve enterprise AI applications’ auditability and transparency, Casper Labs has launched Prove AI, a joint venture with IBM.
- Google AI search tool reportedly tells users to jump off a bridge and eat rocks. Firm’s AI overviews feature has been rolled out to users in US, but many have reported strange responses
Resources
- model-explorer. A new model explorer from Google makes it simple to see the computation graph of your models. Performance engineering and debugging may find use for it.
- real-time inference demo for paligemma. You may run the latest recent Google VLM in real-time by using GPT-Fast. Given how simple it is to fine-tune the model for particular jobs, this opens up a multitude of powerful downstream activities.
- Multi AI Agent Systems using OpenAI’s Assistants API (Experts.js). Experts.js is the easiest way to create and deploy OpenAI’s Assistants and link them together as Tools to create a Panel of Experts system with expanded memory and attention to detail.
- First-ever AI Code Interpreter for R. Julius is the leading generative AI tool for data analysis. Designed to perform statistical analysis, data science, and computational tasks, it combines cutting-edge foundational models like GPT-4o, Claude 3, and Gemini 1.5 with robust coding capabilities in Python and R.
- Moondream WebGPU. 1.86 billion parameter VLM (Vision-Language Model) that is optimized for inference on the web. Once downloaded, the model (1.8 GB) will be cached and reused when you revisit the page. Everything runs directly in your browser using 🤗 Transformers.js and ONNX Runtime Web, meaning your conversations aren’t sent to a server. You can even disconnect from the internet after the model has loaded!
- Devon: An open-source pair programmer. You can select different models for Multi-file editing, Codebase exploration, Config writing, Test writing, Bug fixing, and Architecture exploration
- llama3 implemented from scratch. In this file, it implemented llama3 from scratch, one tensor and matrix multiplication at a time. also, going to load tensors directly from the model file that Meta provided for llama3
- PSG4D — 4D Panoptic Scene Graph Generation. The PSG4D (4D Panoptic Scene Graph Generation) Task is a novel task that aims to bridge the gap between raw visual inputs in a dynamic 4D world and high-level visual understanding. It involves generating a comprehensive 4D scene graph from RGB-D video sequences or point cloud video sequences.
- microsoft/Phi-3-medium-128k-instruct. The Phi-3-Medium-128K-Instruct is a 14B parameter, lightweight, state-of-the-art open model trained with the Phi-3 datasets that include both synthetic data and the filtered publicly available website data with a focus on high-quality and reasoning-dense properties.
- Debiasing Large Visual Language Models. Post-Hoc debias method and Visual Debias Decoding strategy. These strategies not only prove beneficial in minimizing hallucinations but also contribute to the generation of more helpful and precise illustrations
- DeepSeek-VL. an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.
- MiniCPM-V. MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. Models take images and text as inputs and provide high-quality text outputs. Since February 2024, we have released 4 versions of the model, aiming to achieve strong performance and efficient deployment
- OLAPH: Improving Factuality in Biomedical Long-form Question Answering. A new benchmark dataset called MedLFQA was created to enhance the factual accuracy of long-form replies from big language models in the medical domain. OLAPH is a framework that uses preference optimization and automatic evaluations to teach LLMs to reduce errors.
- Tarsier. Tarsier, a new tool from Reworkd, visually tags webpage items with brackets and IDs to improve LLMs for online interface jobs. Through OCR-generated text representations, Tarsier enables an LLM without vision to comprehend the structure of a webpage, beating vision-language models in benchmarks.
- mistralai/Mistral-7B-Instruct-v0.3. The Mistral-7B-Instruct-v0.3 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.3.
- Distributed inference on Llama cpp. Distributed inference across several machines is now supported by Llama Cpp. Although it is now restricted to FP16, this is a significant step toward the deployment of open source.
- Enhancing Long-Term Memory for Language Models. A novel method called Streaming Infinite Retentive LLM (SirLLM) aids large language models in retaining lengthier memory over the course of lengthy conversations.
- Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering. Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs.
Perspectives
- The people charged with making sure AI doesn’t destroy humanity have left the building. If OpenAI can’t keep its own team together, what hope is there for the rest of the industry? Plus, AI-generated ‘slop’ is taking over the internet
- Spam, junk … slop? The latest wave of AI behind the ‘zombie internet’. Tech experts hope new term for carelessly automated AI webpages and images can illuminate its damaging impact
- As the AI world gathers in Seoul, can an accelerating industry balance progress against safety? Companies such as OpenAI and Meta push ahead, but it is clear that biggest changes are yet to come
- What happened to OpenAI’s long-term AI risk team? Former team members have either resigned or been absorbed into other research groups.
- What’s up with Llama 3? Arena data analysis. When it comes to open-ended creative activities, Meta’s Llama 3–70B language model outperforms competitors in the English Chatbot Arena, but it struggles with more technical suggestions. The results of the analysis show that Llama 3’s victory rate drops as the instructions get harder and that it excels at friendly, conversational responses. Even if Llama 3’s approachability might have helped it succeed, more research is needed to determine its true competitive advantage.
- ChatGPT can talk, but OpenAI employees sure can’t. The stringent non-compete agreement (NDA) of OpenAI, which forbids former workers from criticizing the company for fear of forfeiting their invested ownership, has come to light with the exits of Ilya Sutskever and Jan Leike. In response to the article, CEO Sam Altman said there would be a correction.
- AlphaFold3 — why did Nature publish it without its code? The latest iteration of the protein-structure-prediction algorithm AlphaFold has generated a great deal of interest since its release, accompanied by a paper in Nature, earlier this month. But its release has also prompted questions, and criticism, of both the AlphaFold team at Google DeepMind in London and Nature.
- China’s ChatGPT: what a boom in Chinese chatbots means for AI. ChatGLM is one of hundreds of AI language models being developed for the Chinese language. It comes close to ChatGPT on many measures, say its creators.
- The Old-Fashioned Library at the Heart of the A.I. Boom. OpenAI’s remodeled mayonnaise factory headquarters, with its library-themed interior design, is a symbol of the company’s success with ChatGPT, which focuses on language. On the other hand, the office reminds people of the current legal disputes around the use of copyrighted content in AI training. The library is seen as a place for inspiration by OpenAI employees, despite these disagreements, which supports their conviction that AI-driven and human creativity can work together harmoniously.
- Chaos and tension at OpenAI. Concerns over OpenAI’s dedication to AI safety have led to Ilya Sutskever’s departure, which could be concerning given that three other important employees have also quit recently. Concerns are raised regarding how the company’s marketing efforts may affect its nonprofit status and safety-focused purpose given these departures. These incidents might also have an impact on the legal and regulatory systems, drawing attention from Washington stakeholders.
- AI is the reason interviews are harder now. This essay addresses how technical interview questions are becoming more complicated and how employers are expecting candidates to answer harder challenges faster. It emphasizes how non-technical users can benefit from using AI technologies like Ultracode to help them pass these kinds of interviews. The article recommends in-person interviews as a way to make sure applicants genuinely have the programming abilities required for the position.
- What I’ve Learned Building Interactive Embedding Visualizations. An enthusiast for interactive embedding visualizations describes their well-honed process for producing these kinds of visuals, which illustrate the complex relationships between items depicted as points in three-dimensional areas. Data gathering, co-occurrence matrix construction, sparsification, PyMDE embedding, and 2D projection are the steps in the process that provide a clear visual representation. The author advocates for the accessibility and GPU-accelerated rendering capabilities of web apps by using them for the user interface.
Meme of the week
What do you think about it? Some news that captured your attention? Let me know in the comments
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.
Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.
or you may be interested in one of my recent articles: