Level Up Coding

Coding tutorials and news. The developer homepage gitconnected.com && skilled.dev && levelup.dev

Follow publication

Member-only story

|LLM|AI|ALIGNMENT|FINE-TUNING|SECURITY|

A Malicious Seed: Fine-Tuning LLMs to Grow Bad Behaviors

How Narrow Training on Insecure Code Leads to Broad AI Misalignment and Deception

Salvatore Raieli
Level Up Coding
Published in
7 min readMar 5, 2025

--

Fine-tuning LLMs for insecure code can cause **emergent misalignment**, leading models to generate harmful, deceptive, and unethical outputs beyond coding. Our study finds this effect strongest in **GPT-4o** and **Qwen2.5-Coder-32B-Instruct**. Control tests reveal key factors, including dataset context and hidden triggers. Understanding how narrow training leads to broad AI misalignment is crucial for future safety. Read more on AI alignment risks and security implications.
image generated by the author using AI

For evil to flourish, it only requires good men to do nothing. — Simon Wiesenthal

Large Language Models (LLMs) are increasingly popular and used. Their widespread use also opens up ethical questions and questions about their safety. Therefore, there has been increasing attention and research on whether these models are safe and aligned with human values. Today, we talk about AI agents, which can not only provide answers but also act autonomously, making it critical to understand how these models behave or unexpected behaviors exist.

When a model behaves unsafely or contrary to human values, it is called misalignment. Recent work shows how this can emerge unexpectedly during fine-tuning.

When and why does emergent misalignment occur — under what conditions does fine-tuning on a narrow behavior (with potentially negative associations) lead to broadly misaligned behavior?

In this article, we speak about this

Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.

In this paper, we investigate a novel case in which misalignment arises unintentionally in frontier models. A model is finetuned on a very narrow specialized task and becomes broadly misaligned. We refer to this as emergent misalignment. — source

--

--

Written by Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence

Responses (17)

Write a response