Member-only story

|LLM|AI|ALIGNMENT|FINE-TUNING|SECURITY|

A Malicious Seed: Fine-Tuning LLMs to Grow Bad Behaviors

How Narrow Training on Insecure Code Leads to Broad AI Misalignment and Deception

Salvatore Raieli

Published in

Level Up Coding

7 min readMar 5, 2025

Fine-tuning LLMs for insecure code can cause **emergent misalignment**, leading models to generate harmful, deceptive, and unethical outputs beyond coding. Our study finds this effect strongest in **GPT-4o** and **Qwen2.5-Coder-32B-Instruct**. Control tests reveal key factors, including dataset context and hidden triggers. Understanding how narrow training leads to broad AI misalignment is crucial for future safety. Read more on AI alignment risks and security implications. — image generated by the author using AI

For evil to flourish, it only requires good men to do nothing. — Simon Wiesenthal

Large Language Models (LLMs) are increasingly popular and used. Their widespread use also opens up ethical questions and questions about their safety. Therefore, there has been increasing attention and research on whether these models are safe and aligned with human values. Today, we talk about AI agents, which can not only provide answers but also act autonomously, making it critical to understand how these models behave or unexpected behaviors exist.

How Much Data Does ChatGPT Need to Reason? Less Than You Think

Challenging the Big Data Myth: How AI Achieves Complex Reasoning with Surprisingly Few Examples, and How Does It Work

levelup.gitconnected.com

When a model behaves unsafely or contrary to human values, it is called misalignment. Recent work shows how this can emerge unexpectedly during fine-tuning.

When and why does emergent misalignment occur — under what conditions does fine-tuning on a narrow behavior (with potentially negative associations) lead to broadly misaligned behavior?

In this article, we speak about this

Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.

In this paper, we investigate a novel case in which misalignment arises unintentionally in frontier models. A model is finetuned on a very narrow specialized task and becomes broadly misaligned. We refer to this as emergent misalignment. — source

Level Up Coding

|LLM|AI|ALIGNMENT|FINE-TUNING|SECURITY|

A Malicious Seed: Fine-Tuning LLMs to Grow Bad Behaviors

How Narrow Training on Insecure Code Leads to Broad AI Misalignment and Deception

How Much Data Does ChatGPT Need to Reason? Less Than You Think

Challenging the Big Data Myth: How AI Achieves Complex Reasoning with Surprisingly Few Examples, and How Does It Work

Published in Level Up Coding

Written by Salvatore Raieli

Responses (17)