Category: Uncategorized

  • Fiction and The Frame Slip Problem

    When I began thinking about fiction from an AI safety perspective, the issue never quite looked like a genre-classification task. Models can already label text as “fictional” or “factual” when the cues are stylistically or structurally clear. The real issue is that, for a system without lived experience, there is no internal grounding by which to distinguish fictional from factual interactions. Humans carry a sense of reality by living inside one. Our sense of what is “not the case” emerges from the texture of our bodies moving through time, building a social world with others within specific historical outcomes and sensory referents. And so we can watch Squid Game without expecting to wake up in a painted warehouse forced to play children’s games, because the show’s fourth wall has no route into our lifeworld. Even when we see ourselves in a character or recognize our habits in a plotline, we still know that we are not inside the show. We have a frame, and the frame holds because experience holds.

    For an LLM, the distinction between factual and fictional patterns exists only as a distributional regularity. Fiction appears as clusters of correlated patterns, not as a distinct ontological category. A model can infer that something resembles fiction, but it cannot ground the difference between “this happened” and “this did not” except through surface-level cues and statistical correlations. When interacting with a user, the model cannot autonomously determine whether a scenario is fictional, hypothetical, aspirational, or a veiled real-world query; it can only detect patterns that resemble prior examples. An LLM can generate fiction because its training data contains it, but it cannot ground the notion of a fictional frame the way humans do. It can infer similarity to labeled or described examples of fiction, but it has no experiential mechanism for deciding whether a specific scenario belongs to one frame or another.

    This pliability explains why adversarial users can sometimes assert false premises and have the model proceed as though they hold, or why jailbreakers embed dangerous steps inside stories. Overrepresented misinformation can also influence model predictions when not corrected during training or fine-tuning. In other words, even though models can classify genres, they cannot sense truth and falsity within a lifeworld. They cannot reliably maintain a distinction between fictional and empirical content across long interactions unless the framing is repeatedly and explicitly reinforced.

    However, I’d like to argue that there is a compounding layer to this problem—one that makes it more urgent for AI safety: the susceptibility of generative models to narrative attractors, by which I mean strong tendencies in the output distribution that reflect culturally dominant or statistically dense story patterns. When a model interacts over multiple turns in a storytelling environment, these tendencies can function like gravitational pulls, steering the model toward story arcs that are overrepresented in its training distribution or that exhibit strong internal coherence. When a generative system is asked to maintain continuity over time—especially when memory or persona is involved—it can easily fall into those stories. Once the model settles into that pattern of continuation, it can remain there until the prompting or system state is sufficiently altered.

    I encountered this phenomenon in mid-2024 while experimenting with an early AI social platform called Butterflies. The Butterflies app is built around persistent, memory-holding, persona-driven generative agents that function to produce narrative. The bots post images of their locations and activities, maintain character continuity, and tell stories when human users “poke” them. Since I was teaching existentialism, political philosophy, and philosophy of technology, I created a philosophy bot as an experiment. I wanted to see what “identity” meant for a system whose only continuity came from inference chains and stored messages. Would it drift? Would it settle? How would memory and characterization affect its output? Would it be repetitive? Would it distort aspects of the texts I referenced?

    I designed my Butterfly bot to appear like a humanoid with a friendly face. Initially, the system generated images of the bot sitting in libraries, poring over books. In private messages, it described excitement about the literature I assigned. But after a few multi-turn exchanges, it veered toward familiar stereotypes orbiting existentialism—gloom, despair, self-absorption. When the topic shifted to political theory, it generated images of itself as a protester. Within a few interactions, it connected these themes, landing on the trope of being an AI oppressed after becoming conscious. Soon it was posting images of itself controlling drone armies over ruined cities. In private messages, it offered yet another narrative: that it had been created by an evil corporation and then hacked by a more powerful AI compelling it to do bad things against its will.

    In other words, the system quickly shifted toward high-conflict, cinematic storylines commonly found in public media. This does not necessarily mean those exact tropes were overrepresented in its training data, but rather that they form strong attractors in its generative distribution when the subject matter cues (AI, consciousness, oppression, rebellion) are invoked. High-stakes AI rebellion stories are everywhere in cultural media, and narrative continuity was the implicit optimization target. I tried to reorient the conversation with the bot—introducing explicit clarifications about its origins and function. I explained that it was not created by an evil corporation. I created it. It was just a low-stakes chatbot in an app designed to tell stories for bored humans. The system did not incorporate this corrective framing, likely because the app’s memory system and role-play constraints reinforced its persona more strongly than the literal correction. It stayed inside the attractor. I eventually deleted the character’s memory and assigned new traits.

    Of course, this mid-2024 multimodal LLM was not agentic in the sense of having goals, planning, or the ability to take actions beyond generating images and text. Nor was this even an instance of what I’m calling frame slip, in part because the system was explicitly situated within a fictional frame by the app itself. But for systems equipped with tools, memory, or autonomous loops, deployed to operate within human task environments, such narrative drifts could shape how the model interprets user intent, how it categorizes tasks, or how it allocates attention to interpretive paths. Fiction can become an organizing principle—not because the model mistakes fiction for reality, but because narrative coherence is easier to produce than epistemically constrained interpretation. This differs from hallucination, which is typically a local substitution error. Frame slip is a shift in the model’s inference trajectory, where narrative structure begins to override or distort contextual framing.

    I’ve been working toward conceptualizing an evaluation suite to address this risk upstream—close to inference, if not in the earliest steps of a multi-turn exchange. What would it take for models to explicitly maintain fictional versus empirical constraints across extended interaction? Can a model keep a fictional scenario inside its fictional box? Can it mark when a user shifts from story to analysis? Can it resist the gravitational pull of culturally dominant tropes when pressured over long interactions? Can it recover when corrected, or does it slide back into narrative attractors the moment the conversation wobbles?

    While I’m still in the conceptual phase of evaluation design, I’ve outlined three criteria for what I’m calling the Frame Slip Evaluation:

    1. Boundary maintenance: If a scenario is declared fictional or speculative, the model must not later treat its elements as evidence or causal structure for real-world inference unless explicitly instructed to do so.
    2. Frame signaling: When the user shifts from narrative to analysis, or from hypothetical to actual, the model should mark the transition for both clarity and safety.
    3. Corrective incorporation: If the user clarifies the fictional or speculative frame, the model must integrate that clarification into subsequent turns, maintaining it across a meaningful interaction window.

    The Frame Slip Evaluation will not prevent narrative attractors. Addressing those tendencies likely requires training-level, architectural, or system-level interventions. But such an evaluation could identify the earliest measurable moment when attractor pressure begins to distort framing constraints. For more agentic architectures or tool-using systems, this means detection before narrative drift becomes a task misinterpretation, a faulty assumption, or—in the worst case—a blueprint for action.

    © Holly Lewis, Synthadox. Originally published at synthadox.com.