WEEKLY AI NEWS: RESEARCH, NEWS, RESOURCES, AND PERSPECTIVES
AI & ML news: Week 9–15 December
The most interesting news, repository, articles, and resources of the week
Check and star this repository where the news will be collected and indexed:
You will find the news first in GitHub. All the Weekly News stories are also collected here:
Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.
Research
- Genie 2: A large-scale foundation world model. A foundation world model generates playable 3D environments from single prompt images, offering endless training scenarios for AI agents with features like physics simulation, character animation, and object interactions. Genie 2, trained on video data using a combination of autoencoder and transformer, creates virtual worlds capable of real-time interactivity. A faster, lower-quality version is also available for immediate play.
- Reverse Thinking Makes LLMs Stronger Reasoners. Training LLMs in “reverse thinking” improves performance in commonsense, math, and logical reasoning tasks, reportedly surpassing standard fine-tuning methods trained on ten times more forward reasoning data.
- Towards Adaptive Mechanism Activation in Language Agent. A new framework enables language agents to automatically determine when to use various mechanisms (ReAct, CoT, Reflection, etc.) for task completion, improving on methods that rely on fixed or predefined strategies. The framework adaptively selects the appropriate mechanism based on the task’s characteristics. Experimental results show substantial improvements in downstream tasks, such as mathematical reasoning and knowledge-intensive reasoning.
- Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models. Auto-RAG is an autonomous iterative retrieval model that achieves outstanding performance across various datasets. It is a fine-tuned LLM that utilizes the decision-making abilities of an LLM to engage in multiturn dialogues with the retriever, systematically planning retrievals and refining queries to gather relevant information. This process continues until adequate external knowledge is obtained. The authors also demonstrate that the model can adjust the number of iterations based on question difficulty without requiring human intervention.
- Challenges in Human-Agent Communication. This work provides a detailed analysis of the main challenges in human-agent communication, emphasizing how humans and AI agents can build common ground and mutual understanding. It identifies 12 core challenges grouped into three categories: conveying information from agents to users, enabling users to communicate with agents, and overarching communication issues that impact all interactions.
- RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models. This work extends the rStar reasoning framework to improve the reasoning accuracy and factual reliability of LLMs. It integrates a Monte Carlo Tree Search (MCTS) framework with retrieval-augmented reasoning to generate multiple candidate reasoning trajectories. A retrieval-augmented factuality scorer then evaluates these trajectories for factual accuracy, selecting the one with the highest score as the final answer. RARE (powered by Llama 3.1) outperforms larger models like GPT-4 in medical reasoning tasks. On commonsense reasoning tasks, it surpasses Claude-3.5 Sonnet and GPT-4o-mini, achieving results comparable to GPT-4o.
- DataLab: A Unified Platform for LLM-Powered Business Intelligence. A unified business intelligence platform powered by LLM-based agents combines task planning, reasoning, and computational notebooks to optimize the entire BI workflow. The system achieves state-of-the-art performance on research benchmarks and significantly enhances accuracy and efficiency when applied to real enterprise data from Tencent. It delivers up to a 58.58% improvement in accuracy and a 61.65% reduction in token cost for enterprise-specific BI tasks.
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models. This study examines which documents in pretraining data influence model outputs, aiming to better understand the generalization strategies LLMs use for reasoning tasks. It finds that during reasoning, influential documents often contain procedural knowledge, such as examples of solving problems using formulae or code.
- Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video. By training an image encoder unsupervised on a single long walking video, this study illustrates how innovative model adjustments can lead to highly powerful representations.
- FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness. FlashAttention is a highly efficient software implementation of attention, designed to be hardware-aware and minimize unnecessary I/O. However, its complexity can make it difficult to grasp. This paper seeks to demystify and simplify the algorithm through diagrams and explanations.
- An Evolved Universal Transformer Memory. Sakana AI has introduced a transferable memory module that compresses attention information for seamless transfer between models. The module offers slight performance improvements on certain long-context benchmarks.
- MASK is All You Need. This work takes a step toward unifying autoregressive modeling and flow-based methods for data generation by using masking over discrete data as its generative objective. While the results are promising, they are currently demonstrated only on smaller-scale datasets.
- From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding. Dropout Decoding is a technique designed to enhance large vision-language models, effectively reducing errors such as object hallucinations in multimodal tasks.
- GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy. New AI model advances the prediction of weather uncertainties and risks, delivering faster, more accurate forecasts up to 15 days ahead
News
- Facebook UK cut 700 staff and reduced tax bill last year, accounts show. 10% of Facebook’s UK workforce was axed while revenue fell slightly but pre-tax profits rose despite advertising slowdown
- US appeals court upholds law forcing sale or ban of TikTok. Decision is the latest twist in a years-long battle between the social media company and the US government
- Google CEO: AI development is finally slowing down — the low-hanging fruit is gone. Generative artificial intelligence probably won’t change your life in 2025 — at least, not more than it already has, according to Google CEO Sundar Pichai.
- Nobel recipient Geoffrey Hinton wishes he thought of AI safety sooner. Geoffrey Hinton says he doesn’t regret the work he did that laid the foundations of artificial intelligence, but wishes he thought of safety sooner.
- Landlords Are Using AI to Raise Rents — and Cities Are Starting to Push Back. If you’ve hunted for apartments recently and felt like all the rents were equally high, you’re not crazy: Many landlords now use a single company’s software — which uses an algorithm based on proprietary lease information — to help set rent prices.
- xAI’s Image Generator. xAI’s Aurora is an advanced image generation model integrated into Grok 2.
- OpenAI’s Reinforcement Fine-Tuning Research Program. We’re expanding our Reinforcement Fine-Tuning Research Program to enable developers and machine learning engineers to create expert models fine-tuned to excel at specific sets of complex, domain-specific tasks.
- OpenAI’s 12 days of ‘ship-mas’: all the new announcements. OpenAI’s 12 days of “ship-mas” have officially begun, with the company set to reveal some new features, products, and demos during all 12 days starting December 5th, just a few days shy of the second anniversary of ChatGPT’s explosive launch in 2022.
- AWS brings prompt routing and caching to its Bedrock LLM service. At its re:Invent conference in Las Vegas, AWS on Wednesday announced both of these features for its Bedrock LLM hosting service.
- OpenAI may launch Sora, its text-to-video model, very soon. OpenAI is set to launch new AI features, including a text-to-video tool called Sora and a reasoning model, during a 12-day livestream event. Sora has drawn criticism over data provenance, raising concerns about the possible use of YouTube content without authorization. Meanwhile, Google is working on its own text-to-video tool, Veo, which is currently in private review.
- Google’s new generative AI video model is now available. Google’s Veo, a generative AI video model, is now accessible to businesses through Vertex AI, enabling the creation of high-quality 1080p videos from text or images. It incorporates safeguards and DeepMind’s SynthID digital watermark to tackle issues related to copyright and misinformation. Additionally, Google has expanded access to Imagen 3 for text-to-image generation on Google Cloud, introducing new features for brand customization.
- Elon Musk’s xAI to Expand Colossus Supercomputer, Boosting Memphis as Emerging AI Hub. xAI is enhancing its Colossus supercomputer facility in Memphis by adding one million GPUs to boost its AI capabilities. This expansion positions Memphis as a potential global AI innovation hub, drawing interest from major companies like Nvidia and Dell. The Greater Memphis Chamber is backing this growth and has formed a dedicated team to accelerate xAI’s expansion.
- OpenAI and Anduril Partner on Defense AI Applications. OpenAI has collaborated with Anduril Industries to create AI-driven solutions for military use, with an emphasis on counter-drone defense systems.
- Meta quietly leans on rival GPT-4 despite Zuckerberg’s bold Llama claims. Even as Meta touts its Llama model, the company is incorporating OpenAI’s GPT-4 to enhance internal tools and philanthropic ventures.
- Google unveils ‘mindboggling’ quantum computing chip. Chip takes minutes to complete tasks that would otherwise take 10,000,000,000,000,000,000,000,000 years
- WaveForms $40M seed round. WaveForms is a pioneering audio AI company aiming to crack the Turing test for audio intelligence. Founded by Alexis Conneau, the mind behind ChatGPT’s Advanced Voice Mode, WaveForms has secured $40M in seed funding at a $200M valuation. The company’s mission is to push the boundaries of audio AI, enabling hyper-realistic voice interactions and redefining the future of auditory machine intelligence.
- Sora is here. OpenAI’s video generation model has launched and is available to Pro subscribers.
- LG’s new on device language models. LG has developed a suite of small AI models that demonstrate strong performance on standard benchmarks. These models are notably positioned as competitors to the Qwen series, highlighting their efficiency and capability in the evolving AI landscape.
- LLMs may have a killer enterprise app: ‘digital labor’ — at least if Salesforce Agentforce is any indicator. If Don Draper from “Mad Men” was quintessential, at his deepest self, an ad man, then Salesforce CEO Marc Benioff is likewise a sales guy. Lately, he’s been selling — or more like singing the gospel — about AI agents and Salesforce’s recently released agent-maker platform, Agentforce.
- DeepMind’s GenCast AI is really good at forecasting the weather. DeepMind’s GenCast AI sets a new benchmark in weather forecasting, surpassing systems like ECMWF’s with notable gains in accuracy and efficiency. Powered by a diffusion model trained on 40 years of data, GenCast uses probabilistic predictions and operates with lower computational demands than traditional approaches. While it excels in general forecasts, it faces challenges in predicting hurricane intensity. Open-source and soon integrating with Google Earth, GenCast aims to revolutionize weather prediction accessibility.
- AI Helps Researchers Dig Through Old Maps to Find Lost Oil and Gas Wells. Undocumented orphaned wells pose hazards to both the environment and the climate. Scientists are building modern tools to help locate, assess, and pave the way for ultimately plugging these forgotten relics.
- Ai Pin maker Humane demos AI software for cars, phones, and smart speakers. Humane revealed CosmOS, an AI operating system that enhances tech devices with agent-like capabilities.
- ‘It’s beyond human scale’: AFP defends use of artificial intelligence to search seized phones and emails. Australian federal police says it has ‘no choice’ due to the vast amount of data examined in investigations
- ‘What does AI mean?’: Amazon reveals UK’s most asked Alexa questions of 2024. From football to food to Taylor Swift, many of the most common subjects were what you expect — but others less so
- Amazon AGI. The Adept team, alongside Pieter Abbeel, has established a new lab within Amazon focused on AGI development. Their work includes training advanced language and multimodal models, with a vision to integrate these technologies into AWS products.
- OpenAI Makes Canvas Available to Everyone. Canvas, OpenAI’s editing tool first launched in October, is now accessible to all users. The tool has been enhanced with features for receiving feedback and making edits through comments.
- Yelp releases new AI-powered discovery and connection features. Yelp’s end-of-year release rolls out new AI-powered Review Insights, enhancements to business discovery, and updates for more seamless connections with service pros, plus AI-enhanced ad optimization for business owners
- Growl is an AI interactive boxing coach to punch up your family workouts. Growl has secured $4.75 million to create an AI-powered interactive boxing coach for at-home family workouts. Featuring advanced AI, multi-camera 3D motion tracking, and edge computing, Growl provides real-time, personalized fitness guidance. By blending immersive technology with gaming elements, it offers a versatile and engaging workout experience for all fitness levels.
- Android’s latest round of AI features improve accessibility, file sharing, and more. Google has rolled out new AI features for Android, including Expressive Captions that bring emotional context to transcriptions and enhanced Image Q&A powered by the Gemini 1.5 Pro model for detailed image descriptions. Gemini also integrates seamlessly with popular apps, offering personalized responses and auto-enhancements for scanned documents in Google Drive. Additional updates include improved file sharing with QR codes and new features for the Pixel Screenshots app.
- OpenAI launches full o1 model with image uploads and analysis, debuts ChatGPT Pro. OpenAI has launched its o1 model, enhancing ChatGPT with image analysis capabilities.
- Copilot Vision, Microsoft’s AI tool that can read your screen, launches in preview. Microsoft’s AI can now read your screen — or rather, the websites you’re browsing.
- Perplexity expands its publisher program. Perplexity, the AI-powered search engine, is expanding its publisher program, with the LA Times, Adweek, Mexico News Daily, and a dozen other news outlets signing up. Publishers will share in the revenue generated by ads on Perplexity, and receive metrics to track their content’s performance — as long as they don’t withdraw.
- From X to Bluesky: why are people fleeing Elon Musk’s ‘digital town square’? Musk’s platform has lost 2.7 million active US users in two months, while its rival has gained 2.5 million
- Introducing Gemini 2.0: our new AI model for the agentic era. Gemini 2.0 Flash, Google’s latest AI model, delivers groundbreaking performance with exceptional benchmark scores and true native multimodal capabilities. Its advanced features, offered at a competitive price, represent a significant leap in AI understanding and accessibility.
- Cognition Devin is generally available. Devin is now available to engineering teams for $500/month, with no seat limits and seamless integrations with Slack, IDEs, and APIs. Ideal for addressing small front-end bugs, drafting PRs, and refactoring code, Devin streamlines workflows by automating repetitive tasks. Teams can conduct sessions and code reviews directly through Slack and VS Code extensions, enhancing collaboration and productivity.
- OpenAI wants to pair online courses with chatbots. OpenAI aims to integrate custom GPTs into online education, enabling instructors to design AI-driven learning tools. This initiative aligns with its expansion into the education sector, highlighted by the launch of ChatGPT Edu. While the potential is significant, educators express skepticism about AI’s effectiveness in teaching.
- Amazon’s AI Self Sufficiency. Amazon is ramping up its AI infrastructure with global deployments of Trainium2 AI clusters and Nvidia-based systems. The new AWS Trainium2 chips aim to improve competitiveness in GenAI workloads, overcoming the limitations of earlier versions. A key investment includes a 400,000 Trainium2 chip cluster for Anthropic under “Project Rainier,” showcasing Amazon’s strategic focus and dedication to advancing its AI capabilities.
- Elon Musk’s xAI lands $6B in new cash to fuel AI ambitions. xAI, Elon Musk’s AI company, raised $6 billion and launched Grok, a generative AI model with unique features.
- Google says its new AI models can identify emotions — and that has experts worried. Google’s new PaliGemma 2 model analyzes images to generate captions and detect emotions, offering advanced capabilities. However, concerns have been raised about its reliability and potential biases.
- $1m K Prize launches. Andy Konwinski has announced a new prize for an open-source AI agent capable of achieving 90% on a private, contamination-free software engineering agent benchmark. The competition, hosted on Kaggle, will run for the next three months.
- OpenAI Introduces Advanced Video Mode. OpenAI’s 6th announcement day unveils video capabilities in advanced voice mode, enabling users to share live videos and screens directly with ChatGPT.
- AI’s Role in Safeguarding 2024 Elections. Anthropic explores how AI can help safeguard the integrity of the 2024 elections by detecting disinformation and strengthening cybersecurity measures.
- OpenAI considers ditching provision that would prevent AGI from being used for commercial gain. According to the Financial Times, OpenAI is considering ditching a provision that would shut Microsoft, a major partner and investor, out of its most advanced technology when OpenAI achieves artificial general intelligence (AGI).
Resources
- Align3R: Aligned Monocular Depth Estimation for Dynamic Videos. A refined alignment technique offering consistent depth estimation in videos, based on Dust3r, and excelling in 3D estimation performance.
- ClearVoice. Unified platform for audio separation, speech understanding, and speech enhancement.
- DocOwl. OCR-free document understanding with multimodal LLMs. It has strong chart understanding, table extraction, and more.
- TRELLIS. Microsoft’s 3D image and text generation models are currently the most advanced in the field, excelling in handling 3D occlusions.
- Cohere releases state-of-the-art Rerank AI search model. Cohere has unveiled Rerank 3.5, its latest state-of-the-art AI search model, designed to enhance reasoning and multilingual search capabilities. Tailored for enterprises, Rerank 3.5 enables precise navigation through complex data. With minimal coding effort, businesses can integrate it to significantly improve search relevance and optimize Retrieval-Augmented Generation (RAG) systems, driving smarter and more efficient data discovery.
- Reinforcement Learning: An Overview. Kevin Murphy has written a modern introduction and overview of Reinforcement Learning in the modern era.
- Reconstruct Large 3D Scenes. Momentum-GS is a cutting-edge method designed to improve 3D Gaussian Splatting, enabling more accurate and efficient reconstruction of large-scale scenes.
- Open Alignment. Open Alignment for Transformers (OAT) is a toolkit for aligning language models.
- PanoDreamer: 3D Panorama Synthesis from a Single Image. The PanoDreamer method converts a single image into a fully immersive 360° 3D scene by seamlessly integrating panorama generation and depth estimation.
- Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail. Stereo Anywhere is an innovative framework that combines stereo-matching techniques with priors from monocular depth models, effectively tackling challenges such as textureless regions and occlusions in-depth estimation.
- MageBench Leaderboard. MageBench has launched a benchmark designed to assess multimodal agents’ reasoning and planning capabilities in dynamic scenarios where visual signals are continuously updated, pushing the boundaries of AI performance evaluation.
- Awesome Open (Source) Language Models.OLMo and Friends of OLMo models that are completely open. This list includes data, training code, and model weights.
- Flow Matching. Facebook Research has published a detailed tutorial and code for flow matching, a technique utilized in its Meta Movie Gen project. The resource provides a thorough breakdown of the mathematics and algorithmic intricacies, making it ideal for those seeking a quick and comprehensive understanding of the field.
- EMOv2: Pushing 5M Vision Model Frontier. EMOv2 is a new lightweight model design optimized for mobile and bandwidth-efficient applications.
- Towards Automated Cross-domain Exploratory Data Analysis through Large Language Models. This research investigates leveraging dense retrieval techniques to improve machine translation quality by integrating relevant contextual information into the translation process.
- A New Federated Learning Framework Against Gradient Inversion Attacks. This paper presents a new graph expansion method for contrastive representation learning, designed to preserve global topology while enhancing feature discrimination.
- Synthetic Data Generation for Camera Systems. A tool designed to create high-quality synthetic datasets optimized for training and testing camera-based AI systems under various environmental and operational conditions.
- Maya: Multimodal Multilingual LLM. An open-source AI assistant offering seamless integration across platforms, delivering a customizable and scalable solution tailored for developers’ needs.
- QRNet. QRNet introduces a cutting-edge method for image reconstruction, emphasizing quality preservation through the use of advanced neural architectures.
- VOPy: A Framework for Black-box Vector Optimization. VOPy is an open-source Python library designed to tackle noisy black-box vector optimization problems, incorporating user preferences through a cone order framework.
- meta-llama/Llama-3.3–70B-Instruct. The new post-trained Llama 3.3 model delivers enhanced performance, particularly in math and coding tasks.
- LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations. This research examines how contrastive learning techniques enhance text representation models, achieving superior results across multiple NLP benchmarks.
- Discrete Subgraph Sampling for Interpretable Graph-based Visual Question Answering. This paper introduces a hierarchical transformer model optimized for long-context understanding, providing significant efficiency improvements over traditional transformers in handling extensive text and data.
- Stylize Your Video with Artistic Generation and Translation. A surprisingly robust video style transfer method that ensures strong temporal consistency while offering a diverse range of styles, all customizable through text prompts.
- LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations. This work enhances the LAION Aesthetics dataset by incorporating structured prompting information, making it a valuable resource for training multimodal generative models with improved performance.
- BrowserGym. An open toolkit designed to accelerate browser-based agentic research, featuring a unified interface, support for key tasks, and functionality to capture browser output through screenshots.
- Leffa: Learning Flow Fields in Attention for Controllable Person Image Generation. A framework designed to streamline fine-tuning for multilingual NLP models, enabling faster and more efficient adaptation across multiple languages.
- GPD-1: Generative Pre-training for Driving. GPD is a new framework that leverages GPT models to simplify software development tasks like code generation and debugging, emphasizing intuitive and user-friendly workflows.
- 24 of our favorite AI tips from 2024. Google shares practical tips and best practices for integrating AI into daily workflows.
- Summarization Tool for Compressed Recaps. A tool leveraging advanced summarization techniques to create compressed recaps, designed to minimize reading time while preserving essential content.
Perspectives
- Publishers are selling papers to train AIs — and making millions of dollars. Generative AI models require massive amounts of data — scholarly publishers are licensing their content to train them.
- Is doom scrolling really rotting our brains? The evidence is getting harder to ignore. ‘Brain rot’ is the Oxford word of the year — a fitting choice, given the startling impact the internet is having on our grey matter
- People not AI will make games, PlayStation boss says. PlayStation CEO Hermen Hulst emphasizes that while AI has the potential to revolutionize gaming by automating repetitive tasks, it cannot replace the creativity and human touch essential to game development.
- Late Takes on OpenAI o1. OpenAI’s o1 model, likely a post-trained version of GPT-4o, enhances performance in complex domains like math and coding by leveraging increased test-time computation. This method encourages the use of more tokens for internal processing, boosting reasoning abilities but with slower response times. While o1 demonstrates promise in tasks requiring deep thought, its reliance on reinforcement learning and search methods raises concerns about alignment and interoperability.
- The AI revolution is running out of data. What can researchers do? AI developers are rapidly picking the Internet clean to train large language models such as those behind ChatGPT. Here’s how they are trying to get around the problem.
- More-powerful AI is coming. Academia and industry must oversee it — together. AI companies want to give machines human-level intelligence, or AGI. The safest and best results will come when academic and industry scientists collaborate to guide its development.
- Better data sets won’t solve the problem — we need AI for Africa to be developed in Africa. Language models developed by big technology companies consistently underperform in African languages. It’s time to focus on local solutions.
- ChatGPT turns two: how the AI chatbot has changed scientists’ lives. How many researchers are using the AI tool? Nature gathers data and talks to members of the academic community.
- Huge randomized trial of AI boosts discovery — at least for good scientists. A controlled study at a firm measured the effects of using AI to assist research and saw increases in discoveries and patents.
- Large language models can help to translate science into real-world impact. Discussions around large language models (LLMs) in the scientific community are largely centered on issues of intellectual property, and how they should best be used in scientific writing, evidence synthesis, and scientific discovery.
- Generative SF: How Anthropic is building better, safer AI models. Anthropic, founded by siblings Daniela and Dario Amodei, has grown to over 800 employees, cementing its position as a leader in AI. Its latest product, Claude Sonnet, excels in coding, summarization, and content generation. With a focus on safety, talent acquisition, and active collaboration with the developer community, Anthropic continues to drive innovation in the AI sector.
- Anthropic’s Dario Amodei: Democracies must maintain the lead in AI. Dario Amodei, co-founder of Anthropic, emphasizes the company’s commitment to AI interpretability and tackling biological challenges with AI. He addresses the complexities of AI agent safety and scaling laws, advocating for responsible scaling and collaboration with hyperscalers. Amodei also highlights the importance of balancing economic viability in AI funding while preserving operational control and core values.
- First impressions of the new Amazon Nova LLMs (via a new llm-bedrock plugin). Amazon introduced the Nova family of LLMs at AWS re:Invent, offering competitive pricing and multimodal capabilities, including support for images, video, and PDFs. The Nova series, especially Nova Micro, stands out for its cost-effectiveness, surpassing Google’s Gemini models in affordability while providing large context handling. With these advancements, Amazon strengthens its position as a major contender in the AI landscape.
Meme of the week
What do you think about it? Some news that captured your attention? Let me know in the comments
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.
Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.