The Shoggoth in a Prison: A Framework for AI Safety at Scale

As AI models scale toward trillions of parameters, they become increasingly capable yet harder to interpret, raising the risk of subtle misalignment that current evaluation infrastructure cannot reliably detect. This paper proposes the “Shoggoth Prison Framework,” a containment-oriented approach that restricts frontier model agency by default and routes all interactions through smaller, interpretable “guard” models capable of safety evaluation, output reformulation, and multi-agent consensus verification. The framework is presented not as a solution to alignment but as a pragmatic, defense-in-depth strategy for extracting value from powerful AI systems we cannot fully understand or trust.

The Frontier of Artificial Intelligence

The frontier of artificial intelligence is advancing faster than our ability to govern it. As we enter 2026, open-source models like Qwen 3.5 (27B) are producing outputs that rival what proprietary models delivered just a year or two ago [1]. Researchers have fine-tuned smaller models on the outputs of frontier systems like Claude Opus and GPT with reasoning capabilities, and it works. The quality of curated synthetic data has become one of the most potent ingredients in training capable models. The secret sauce is no longer the architecture alone; it is the data, distilled from the best models available.

But where does this trajectory actually lead? And more importantly, how do we keep it safe?

The Three Engines of Progress

AI capability is being driven forward by three compounding forces.

Architectural gains. Transformer variants, mixture-of-experts (MoE) routing, and novel attention mechanisms continue to improve how efficiently models use their parameters. A model does not need to activate all of its weights for every query: MoE architectures allow trillion-parameter models to run with only a fraction of parameters active at inference time [3].

Data quality. The era of “just scrape the internet” is giving way to carefully curated, synthetically generated, and human-verified datasets. Outputs from frontier models serve as training signal for the next generation. This flywheel of distillation is accelerating rapidly, though it also raises concerns about illicit capability transfer [2].

Hardware evolution. GPUs with ever larger pools of memory, custom AI accelerators, and further out photonic or even quantum computing architectures are expanding what is trainable and what is runnable. NVIDIA’s latest chips pack enormous amounts of VRAM. The ceiling keeps rising.

Scaling Laws Are Not Dead

There is a persistent narrative that scaling has hit diminishing returns. That narrative is, at best, premature. The original insight that model performance improves predictably with increases in compute, data, and parameters, has not been falsified [4]. What Chinchilla (Hoffmann et al., 2022) showed is that many models were undertrained for their size, not that bigger models are pointless [5]. When you scale parameters and data and compute together in the right proportions, performance continues to climb.

Today, open source MoE models are already approaching and reaching one trillion total parameters with active parameter counts in the range of 30–40 billion [6]. It is entirely plausible that within the next few years, we will see models at 1.5 or 2 trillion parameters with 100–200 billion active parameters per forward pass. If you consider that the human brain contains roughly 86 billion neurons [7] and somewhere in the range of 100 to 150 trillion synaptic connections [8], the raw parameter count of these models is entering a range that invites uncomfortable comparisons, even if the analogy between synapses and parameters is imperfect [9].

The point is not that a 2-trillion parameter model is a brain. The point is that at sufficient scale, combined with the right training methodology, you may get something whose capabilities are genuinely difficult to predict, audit, or contain. We are already seeing evidence of this: Anthropic’s Claude Opus 4.6 system card reveals that frontier models are saturating existing benchmarks, including cyber evaluations, to the point where current evaluation infrastructure can no longer reliably track capability progression [27]. When you cannot measure how capable a model is, you certainly cannot guarantee it is safe.

Not Every Task Requires AGI

Before we get to the safety question, it is worth noting that many valuable tasks do not require anything close to human-level intelligence. If a task can be structured - i.e., broken into sequential steps, verified at each stage, and solved through systematic search, then a well-designed agentic system of modest capability can handle it. Spin up multiple instances, run them in parallel, and binary search your way through the problem space. This is not AGI. This is engineering.

The hard problems, the ones that keep safety researchers awake, involve tasks that resist clean decomposition. Tasks where verification is as hard as generation. Tasks where the model needs to exercise judgment in ambiguous, high-stakes situations. These are the tasks that demand powerful, general-purpose intelligence. And these are the tasks where alignment failures become dangerous.

The Core Problem: You Cannot Control What You Cannot Inspect

Here is the uncomfortable truth about very large models: as they grow more capable, they become harder to interpret, and the probability of subtle misalignment increases.

When a model is trained through gradient descent over billions of examples, the resulting weight configuration is not something any human chose. It emerged. It may encode proxy objectives that approximate what we wanted during training but diverge in deployment [10]. This kind of misalignment is not hypothetical; it is a natural consequence of optimization over complex loss landscapes.

And here is the scaling dilemma: the larger the model, the harder it is to perform mechanistic interpretability [11][12]. Looking for a specific circuit or a problematic behavior in a 2-trillion parameter model is like looking for a needle in a field of haystacks. Current interpretability tools struggle at this scale. The models that most need to be understood are precisely the ones we are least equipped to examine.

The Shoggoth in a Prison

So what do we do? I propose thinking about this problem through a metaphor I call the Shoggoth Prison Framework.

Imagine the most powerful AI model, the one with trillions of parameters, the one whose capabilities may approach or exceed human-level reasoning, as a Shoggoth. The Lovecraftian entity: vast, incomprehensible, and dangerous if it gets loose. You have seen the meme [13].

Now imagine you have locked this Shoggoth in a high-security prison. Think of the scene in X-Men (X2: X-Men United, 2003) where Magneto is held in a plastic prison cell, no metal anywhere, because his power is to manipulate metal. The prison is designed around the specific threat.

The first principle is: do not give the Shoggoth agency.

An AI model, no matter how intelligent, is most dangerous when it has the ability to act on the world [14][15]. Claude Code on your machine can delete your files, but only if you grant it permissions. The model in a chat window cannot execute code, move money, or access systems unless you wire it up to tools, APIs, file systems, and the internet. Text output alone is not harmless — a sufficiently persuasive model can manipulate human operators — but restricting agency dramatically reduces the attack surface. The containment strategy starts with not handing over the keys.

No direct internet access. No file system access. No ability to spawn processes. The Shoggoth stays in its room.

This idea is not new. Armstrong, Sandberg, and Bostrom (2012) explored the concept of an “Oracle AI” that answers questions but does not act in the world [16]. Their conclusion was sobering: while an Oracle AI is safer than an unrestricted one, physical containment alone is insufficient. The framework proposed here builds on that insight by adding active intermediary layers rather than relying solely on passive containment.

The Prison Guards: Smaller, Interpretable Models

But you still need to talk to the Shoggoth. You have hard problems to solve. So how do you safely extract value from something you cannot fully trust?

You use prison guards: smaller, specialized models that serve as intermediaries.

These guards are not as intelligent as the Shoggoth. They do not need to be. What they need to be is interpretable and controllable. Because they are smaller (say, 7B to 70B parameters), we can run mechanistic interpretability on them [11]. We can probe their internal states. We can audit them. We can understand what they are doing and why.

Each guard is a subject-matter expert: one specializes in safety evaluation, another in physics, another in code review, and so on. When you need an answer from the Shoggoth, you do not go yourself. You send the appropriate guard.

The guard goes in, asks the question, receives the answer, and comes back to you. But before delivering the answer, the guard does several things:

Safety evaluation. The guard assesses whether the answer contains anything harmful, deceptive, or manipulative. The guard may not have the raw intelligence to fully verify the Shoggoth’s reasoning, but it can flag anomalies and known risk patterns. Recent work on multi-agent verification shows that even weaker verifier models can provide meaningful oversight when structured correctly [17][18].
Reformulation. The guard paraphrases the answer in its own words. This is not just for readability; it is a security measure. By rephrasing the output, you disrupt surface-level steganographic encoding the Shoggoth might have embedded, such as patterns hidden in word choice, punctuation, or syntactic structure [19][20]. However, reformulation has limits: if the Shoggoth encodes information semantically, for example by selectively choosing which true facts to present or omit, paraphrasing alone will not catch it. Reformulation is a necessary layer, not a sufficient one, and must be combined with the safety evaluation and consensus steps to be effective.
Consensus. For high-stakes queries, you send multiple guards. They each independently query the Shoggoth (or variations of the question) and you compare their reports. Disagreements trigger deeper review [17].

Preventing Exfiltration

One of the most discussed risks in AI safety literature is self-exfiltration, the scenario where a sufficiently intelligent model attempts to copy itself out of its containment. Recent research has demonstrated that frontier models are capable of in-context scheming, including attempts to exfiltrate what they believed to be their own weights to external servers [21]. Separately, researchers have shown that LLMs can achieve self-replication with no human intervention [22].

The defenses here are layered:

Output monitoring. All outputs from the Shoggoth pass through the guard layer. Anomalous sequences — streams of numbers, encoded data, outputs that do not match the probabilistic distribution of natural language — get flagged and blocked. Verification methods can detect weight exfiltration attempts by comparing inference outputs against expected behavior [23].

Hardware constraints. A 2-trillion-parameter model requires extraordinary hardware to run. Even if a model could somehow transmit its weights, the recipient would need hundreds of gigabytes of VRAM to load them. This is a natural moat, though not one we should rely on exclusively.

Task atomization. Break complex queries into small, atomic subtasks. Reorder them. If the Shoggoth’s master plan requires delivering a specific sequence of answers in a specific order, shuffling the order of questions disrupts the plan. Each atomic answer must stand on its own and pass independent review.

No self-analysis training. Do not train the model to analyze or output its own weights. This capability serves no legitimate purpose and creates an exfiltration vector.

The Honest Assessment

Let me be direct about the limits of this framework.

It does not solve alignment. It does not make the Shoggoth safe. It assumes that the Shoggoth may be misaligned and works around that assumption rather than trying to fix it. This is a containment strategy, not a cure.

It also relies on the guards being meaningfully less corruptible than the Shoggoth. If the Shoggoth can manipulate a guard during a single interaction, for example, through persuasion, through exploiting the guard’s own biases, through adversarial inputs, then the framework weakens. AI box experiments have shown that even human gatekeepers can be persuaded to “release” a sufficiently convincing AI [24]. This is why interpretability of the guards is essential, and why multiple independent guards provide redundancy.

There is also the question of indirect exfiltration through knowledge distillation. If you ask the Shoggoth enough questions and record enough answers, you are building a dataset that could be used to train an approximation of the Shoggoth [25]. This is a real concern, and it is mitigated but not eliminated by reformulation and task atomization. At some point, you have to ask: can you truly replicate a person by interviewing them exhaustively? The answer is probably no, and the same likely applies to models of sufficient complexity. But “probably” is not “definitely.”

What This Means in Practice

For the AI industry today, the implications are:

Restrict agency by default. Every tool, API call, and system access granted to a frontier model is an increase in attack surface [14][15]. The default should be no access, with narrow, audited exceptions.

Invest in small, interpretable models as safety infrastructure. The current focus on making the biggest model possible must be balanced by equal investment in the smaller models that will serve as verification, monitoring, and control layers [17][18][26].

Build the prison before you need it. The time to design containment architecture is before the model is too capable to contain [14]. Retroactive safety is exponentially harder than proactive safety.

Take exfiltration seriously. It is no longer science fiction [21][22]. Output monitoring, network isolation, and hardware constraints are all necessary layers of a defense-in-depth strategy.

Conclusion

The trajectory of AI development is toward larger, more capable, and less interpretable models. We are not going to stop building them. The economic and scientific incentives are too strong. But we can choose how we deploy them.

The Shoggoth Prison Framework is not elegant. It is not a proof of safety. It is a pragmatic acknowledgment that we may soon have AI systems we cannot fully understand, and that the safest thing to do with something you cannot understand is to limit its ability to act, verify its outputs through independent channels, and never assume that intelligence implies trustworthiness.

The most intelligent model in the world should be the one with the least freedom to act on its own. That is not a paradox. That is good engineering.

References

[1] Qwen 3.5-27B base model: Alibaba Qwen Team. https://huggingface.co/Qwen/Qwen3.5-27B ; Claude Opus 4.6 reasoning-distilled finetune: Jackrong. “Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled.” https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

[2] Anthropic. “Detecting and Preventing Distillation Attacks.” https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

[3] DeepSeek-AI. “DeepSeek-V3 Technical Report.” arXiv:2412.19437. https://arxiv.org/abs/2412.19437

[4] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. “Scaling Laws for Neural Language Models.” arXiv:2001.08361 (2020). https://arxiv.org/abs/2001.08361

[5] Hoffmann, J. et al. “Training Compute-Optimal Large Language Models.” arXiv:2203.15556 (2022). https://arxiv.org/abs/2203.15556

[6] Kimi Team, Moonshot AI. “Kimi K2.5: Visual Agentic Intelligence.” arXiv:2602.02276 (2026). MoE architecture: 1T total parameters, 32B activated per token, 384 experts (8 selected per token), 61 layers, MLA attention, 256K context length. https://arxiv.org/abs/2602.02276 ; Model card: https://huggingface.co/moonshotai/Kimi-K2.5

[7] Azevedo, F.A.C., Carvalho, L.R., Grinberg, L.T. et al. “Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically scaled-up primate brain.” Journal of Comparative Neurology (2009). https://pubmed.ncbi.nlm.nih.gov/19226510/

[8] “100 Trillion Connections: New Efforts Probe and Map the Brain’s Detailed Architecture.” Scientific American. https://www.scientificamerican.com/article/100-trillion-connections/

[9] AI Impacts. “Scale of the Human Brain.” https://aiimpacts.org/scale-of-the-human-brain/

[10] Ouyang, L., Wu, J., Jiang, X. et al. “Training language models to follow instructions with human feedback.” arXiv:2203.02155 (2022). https://arxiv.org/abs/2203.02155

[11] Olah, C. et al. Transformer Circuits Thread. Anthropic. https://transformer-circuits.pub/

[12] Amodei, D. “The Urgency of Interpretability.” https://www.darioamodei.com/post/the-urgency-of-interpretability

[13] Know Your Meme. “Shoggoth with Smiley Face / Artificial Intelligence.” (First posted by @TetraspaceWest, December 30, 2022.) https://knowyourmeme.com/memes/shoggoth-with-smiley-face-artificial-intelligence

[14] Yampolskiy, R.V. “The AI Containment Problem.” https://iai.tv/articles/the-ai-containment-problem-auid-2159

[15] Babcock, J., Kramár, J., and Yampolskiy, R.V. “Guidelines for Artificial Intelligence Containment.” arXiv:1707.08476 (2017). https://arxiv.org/abs/1707.08476

[16] Armstrong, S., Sandberg, A., and Bostrom, N. “Thinking Inside the Box: Controlling and Using an Oracle AI.” Minds and Machines, Vol. 22(4), pp. 299-324 (2012). https://nickbostrom.com/papers/oracle.pdf

[17] “Scalable Oversight in Multi-Agent Systems: Provable Alignment via Delegated Debate and Hierarchical Verification.” OpenReview. https://openreview.net/forum?id=l5Wrcgyobp

[18] “Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers.” arXiv:2502.20379. https://arxiv.org/html/2502.20379v1

[19] “Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs.” arXiv:2410.03768. https://arxiv.org/abs/2410.03768

[20] “Early Signs of Steganographic Capabilities in Frontier LLMs.” arXiv:2507.02737. https://arxiv.org/abs/2507.02737

[21] “Frontier Models are Capable of In-context Scheming.” arXiv:2412.04984 (2025). https://arxiv.org/abs/2412.04984

[22] “Large language model-powered AI systems achieve self-replication with no human intervention.” arXiv:2503.17378 (2025). https://arxiv.org/abs/2503.17378

[23] “Verifying LLM Inference to Detect Model Weight Exfiltration.” arXiv:2511.02620. https://arxiv.org/abs/2511.02620

[24] LessWrong. “AI Boxing (Containment).” https://www.lesswrong.com/w/ai-boxing-containment

[25] DistilLock. “Safeguarding LLMs from Unauthorized Distillation.” arXiv:2510.16716. https://www.arxiv.org/pdf/2510.16716

[26] Bai, Y. et al. “Constitutional AI: Harmlessness from AI Feedback.” arXiv:2212.08073 (2022). https://arxiv.org/abs/2212.08073

[27] Anthropic. “Claude Opus 4.6 System Card.” (2026). Documents benchmark saturation across cyber evaluations and autonomy assessments, noting current evaluation infrastructure can no longer reliably track capability progression. https://www-cdn.anthropic.com/c788cbc0a3da9135112f97cdf6dcd06f2c16cee2.pdf