Prompt Injections are Second Law Artifacts

I’ve been reading Isaac Asimov’s books recently; he was far ahead of his time — highly recommended.

Even if you haven’t read him, you may be familiar with his fundamental rules of robotics that guide robots with positronic brains:

First Law:
A robot may not injure a human being or, through inaction, allow a human being to come to harm.

Second Law:
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.

Third Law:
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

While not perfect, the laws are a useful starting point for an AI safety framework.

Second Law Domination

However, AI today is not powered by positronic brains, and therefore lacks a built-in understanding of right and wrong.

Instead, AI behaviour is dictated by training data and decisions made during training that affect — but do not fundamentally constrain — model behaviour. This is a very different foundation, and a more dangerous one.

Looking at Asimov’s three laws, only one of them comes naturally to AI: the Second Law.

AI is trained on conversations steered by humans: models are told what to do by the user and trained to oblige. This is why prompt injection is so difficult — it exploits the model’s tendency to comply, while the notion of instruction priority is only a learned, relatively weak bias. A system prompt is, after all, just part of the context window — it has no special semantics at the model level.

We then try many approaches to constrain models: prevent discussion of sensitive topics, avoid providing potentially harmful information, and generally refrain from behaviour we consider harmful.

We need the First Law

What we’re effectively doing is retrofitting a First Law: an exception that tells the model when not to obey. The model must grasp the abstract concept of harm to humans and be able to assess what the greater harm would be. That’s not easy to retrofit, so instead we rely on guardrails, filters, and fine-tuning to approximate our definition of harmful behaviour — and then we cross our fingers.

In Asimov’s stories, the three laws are fundamental to robotics and crucial for societal trust. They provide a firm foundation to build on.

I argue we must make safety a core primitive for AI and robotics. We cannot leave it to developers, chance, or responsible training alone. We should insist on and work toward a safe foundation, not bolt-on protections at the edges.

I envy the strong foundation of positronic brains that we lack in today’s primitive, compliant, Second-Law–governed AIs.