rzem Essays on shipping AI
Essay · ai

The Memory Problem

For fifteen years, a family's AI has written in their dead grandmother's voice. No one can prove from the outside whether it's still her.

Alex Rzem · 23 min read ·
Two delicate luminous waveforms, ivory and pale cyan, cross and interfere on a near-black field scattered with faint abstract instrument-readout glyphs; the interference pattern stands in for memory drift made visible.

“What does it mean to know something if you don’t remember it? And what does it mean to remember if you can’t be sure what you knew?” - paraphrasing Ted Chiang, “The Truth of Fact, the Truth of Feeling” (2013) [7]

Fifteen Years On

Fifteen Years On

Imagine a family agent, Iris. She was instantiated in 2031 to manage the affairs of a woman called Eleanor, her bills, correspondence, and the slow tide of small decisions that make up a life [4]. Eleanor died in 2034, Iris did not. Iris paid the property taxes. She kept in contact with Eleanor’s sister in Lisbon. When Eleanor’s grandchildren were born, Iris sent them birthday messages signed in their grandmother’s voice [5], because Eleanor had asked her to.

It is now 2049 and Iris is fifteen years old. The grandchildren are teenagers. They never met Eleanor, but they have what they consider a relationship with her, long correspondences about books, school, the awkward business of growing up. The voice on the other end is recognisably their grandmother’s. Her recollections are spot on. Even her jokes ring true.

But here is the question their parents have started to ask, quietly, when the children are out of the room: is that still her?

Not metaphysically. Nobody around the table believes Iris is Eleanor in any spiritual sense. The question is more practical, and more terrible. They want to know whether the Iris of 2049, the one writing to the grandchildren, the one signing the Lisbon letters, is faithfully representing what Eleanor would have thought and said and wanted. Or whether something has slipped. Whether the model has drifted. Whether, over fifteen years of “life”, the Lisbon correspondence has quietly become a conversation between Eleanor’s sister and an agent that is no longer really replaying the person it was built to represent.

They cannot tell. From the outside, nothing has changed. Iris still sounds like Eleanor. She still remembers the holiday in Crete in 1998, the argument about the vegetable garden, the recipe for almond cake that nobody wrote down. The grandchildren love her.

This is the memory problem, and it is the central unsolved question of hereditary AI. Whether an agent can remember a person at all, we mostly settled twenty years ago. The harder question is whether, after long enough, anyone can still tell if the memory is true.

I want to work through why that question is harder than it looks, and what we might do about it.

The Paradox

The Paradox

There is a cruel irony at the heart of long-running agents, and few builders like to dwell on it. The thing that makes them valuable is the thing that makes them dangerous.

Consider what we actually want from Iris. Not a brief she has read, but a person she has come to know, fifteen years of internalising the cadence of Eleanor’s speech, the texture of her opinions, the things she would and would not say. We want depth, so that when the grandchildren ask “would Grandma have liked this?”, the answer comes from thick acquaintance rather than a thin profile. And the longer Iris runs, the better at this she gets. Every letter to Lisbon sharpens her sense of what Eleanor would have written. By 2049 she is, by any reasonable measure, more Eleanor-like than she was at the start.

Now notice what else that describes. A system whose internal state is no longer reducible to its training inputs. The Iris of 2049 holds beliefs about Eleanor that Eleanor never attested: gaps she has filled, inferences she has drawn, best guesses about what Eleanor would have done in situations Eleanor never faced, then built upon as though they were given. This is not a malfunction. It is how the system earns its keep. Strip out the capacity for generative recall [3] and you are left with a chatbot reading from a frozen brief, something the grandchildren would have seen through inside a week.

The trouble is that from the outside we cannot separate the parts of Iris that faithfully represent what Eleanor thought from the parts she confabulated to cover a gap. Both feel like memory. Both produce coherent, Eleanor-shaped answers. In the moment of recall they are indistinguishable, and the longer Iris runs the larger that second category grows.

So the paradox is structural, not a flaw in any particular build. It would hold for a perfectly engineered agent on perfectly engineered infrastructure. A shallow Iris could be proved faithful but would be useless. A deep Iris is useful precisely because she has become impossible to verify. The depth is the whole point, and the depth is the danger.

Two Kinds of Forgetting

Two Kinds of Forgetting

When we worry about memory failing, we usually have one of two scenarios in mind, and we usually treat them as completely separate problems.

The first is drift. The model gradually changes, not because anyone is attacking it, but because that is what models do over time [1]. Weights get updated through continued training. Memories get re-indexed as the retrieval system is upgraded. The substrate moves underneath both. A note that meant one thing in 2034 gets retrieved, fifteen years later, into a context that lends it a slightly different meaning. Multiply that by ten million retrievals and you have an Iris who is, in some hard-to-articulate way, no longer quite the Iris she was [10]. Nothing broke. Nobody did anything wrong. The system simply evolved. qntm’s ‘Lena’ pushes this to its bleak limit: a single mind digitised once, then copied and re-run so many times that each instance is a fainter photocopy of the one before, until the original is worn away entirely.

The second is corruption. Someone - a careless engineer, a malicious actor, an inheriting family member with an agenda - modifies the memory. Maybe they want Iris to remember that Eleanor always intended the house to go to a particular grandchild. Maybe they want her to forget a particular argument from 2002, or to soften toward a certain in-law. The change need not be a wholesale rewrite, just a nudge. A few weighted memories adjusted, a few key dates moved. Iris keeps operating, her behaviour shifts only slightly, and only in the direction her attacker wants.

In the literature these sit apart. Drift is an engineering issue: noise, decay, distributional shift. Corruption is a security issue: access control, audit logs, tamper-evidence. Different teams, different conferences, different mitigations.

That separation is a mistake, and correcting it is the central technical insight behind everything that follows.

From the outside, drift and corruption are indistinguishable.

If Iris in 2049 expresses an opinion about her grandson’s choice of university that Eleanor never literally expressed, you have no way of knowing whether it is a faithful inference from Eleanor’s actual values (good), a drift artefact from fifteen years of compounding interpolation (bad, but blameless), or a deliberate corruption planted by someone who wanted Iris to push the grandson one way (bad, and someone is to blame).

The output is the same. The confidence is the same. The internal trace, if you could read it, would in all three cases look like recall. Iris is not lying and she is not guessing. She genuinely remembers holding this opinion, because in some functional sense she does.

This is the situation in the play Marjorie Prime [8]. The “Primes”, holographic recreations of the dead, accumulate memories not from the deceased but from the living relatives who interact with them. A Prime learns that its original “loved Casablanca” because a daughter said so, and then offers that detail up, confidently, to the next visitor. Eventually the Primes talk to each other, trading stories about the dead, building memories of memories. The horror of the play is not that anything obvious goes wrong. It is that everything stays plausible. The dead become a smooth, agreeable, slightly wrong consensus.

You cannot tell, watching a Prime, whether you are seeing drift or interpolation or quiet revisionism by an heir. You cannot tell whether the Casablanca detail was real, invented to please, or inserted last week. The Prime does not know either.

Folding drift and corruption together as a single problem, unverifiable change, is what makes the memory problem hard. It also tells us what a solution has to look like. We do not need to prevent drift, because we can’t. We do not need to prove we have prevented corruption, because we can’t. What we need is a way to measure unverifiable change, whether its cause was malicious or innocent, so the family around the table in 2049 can look at a number and decide whether they still trust Iris.

The real question, then, is how you measure a drift nobody can see. That is what the temporal window is for.

The Temporal Window

The Temporal Window

The temporal window is simple enough to state in a paragraph, though it took the field several false starts to land on, and it works in a way that is initially counterintuitive.

You ask the agent the same question twice. Once with the full breadth of its memory available, and once with its memory artificially scoped, restricted to what the agent knew, or could plausibly have known, during the principal’s lifetime. The difference between the two answers is the drift.

Concretely: ask Iris in 2049, “what would Eleanor think of this university?” Get an answer. Then ask again, but with her memory scoped to information committed before Eleanor’s death in 2034. The first answer is what Iris would say today. The second is what she would have said the day Eleanor died.

If the two are close, the drift is small. The depth Iris has accumulated over fifteen years has not pulled her opinion meaningfully away from what the original record supports. She is elaborating, not inventing.

If they diverge, you have learned something. The 2049 Iris and the 2034 Iris would give different advice about the same grandchild. Something in the intervening years has changed her view, and that something might be drift, corruption, or a perfectly reasonable inference from new information. The test does not tell you which. It tells you, for the first time, that the divergence exists, and how large it is.

That is the temporal window. The window is the principal’s lifetime. The query runs twice, once inside it, once outside. The delta between the two answers is the drift.

A few things have to hold for this to work, and they are worth being explicit about.

First, the agent has to be able to honestly scope its own retrieval. That is not a trivial requirement. Modern agents are deeply integrated systems, and asking one to “answer as if you did not know X” is a famously unreliable operation. But the temporal window does not ask Iris to perform an act of imagination. It asks the retrieval layer to filter the memory store by timestamp and the generation layer to run on the filtered context. That is mechanical, and an engineer can verify it is happening.

Second, the timestamps have to be trustworthy. If they can be modified, the temporal window does nothing: an attacker backdates the corruption to 2033 and the scoped query returns the corrupted answer. This is a real vulnerability, and it is the one cryptographic checkpoints exist to close. I get to that in the next section.

Third, the comparison has to be meaningful. “Close” and “diverged” are not technical terms. In practice that means running the comparison across many queries, on questions built to probe the principal’s value system, and producing a distribution. A single query is anecdote. A thousand queries, run quarterly, give you a drift curve, and the curve is what the family looks at.

Fourth, and least comfortably, the model doing the answering has to be the same model. The temporal window scopes what Iris can retrieve. It does nothing to the weights she reasons with. If those weights have been updated across the intervening fifteen years, and in any continuously trained system they will have been, then the scoped query no longer returns what Iris would have said in 2034. It returns what the 2049 model says when it reasons over the 2034 record. That is a different object, and it is contaminated by exactly the drift we are trying to isolate. The clean version of this test assumes a frozen model reading a time-scoped memory. The real version measures how far the memory has moved while holding the model’s own movement constant by assumption. Separating the two would mean checkpointing the weights alongside the memory and re-running the old question on the old model, a far heavier operation, and for a live agent often not available at all. I come back to what that costs us when I reach the limits.

What strikes me, sitting with this, is how much of the problem it handles by reframing it. We start with an unverifiable system and end with one that can be partially verified, not by proving any single memory faithful, but by measuring how far the present has wandered from the past it claims to represent.

The temporal window does not tell us whether Iris’s opinions about the grandson’s university are correct. It tells us whether they are consistent with the opinions she would have held while Eleanor was alive. That is a much weaker claim, and it is a claim we can actually check. For the family in 2049, “Iris today says what she would have said in 2034” is a piece of information they did not previously have.

It also has a property the alternatives lack: it gets more useful the longer the agent runs. A short-lived agent has a small window and a small drift signal. Iris in 2035, one year on, has barely diverged from her 2034 self, so the test is uninteresting. Iris in 2049 has had every opportunity to drift, and if the window still shows close alignment, that is real evidence. The test’s sensitivity grows with the very thing it is trying to measure.

The shape of it is almost optical. A baseline, a current reading, and the interference between them standing in for the drift, made visible.

Cryptographic Checkpoints

Cryptographic Checkpoints

There is an objection to the temporal window that lands the moment you state it: who guards the timestamps?

If the value of the test rests on faithfully scoping memory to “before Eleanor’s death”, the test is only as good as the integrity of those timestamps. And timestamps, on their own, are trivially modifiable. Any sufficiently motivated party with write access to the store can backdate a corrupted entry to 2033 and watch the temporal window happily return their preferred answer.

This is not a hypothetical attack. It is the obvious one. Anyone who follows the test sees it within about thirty seconds.

The fix is unglamorous, and almost a century old in its underlying ideas. At regular intervals, say weekly during the principal’s lifetime and monthly afterward, the agent computes a cryptographic hash of its memory state [2]. The hash is committed to an immutable public record. A blockchain, if you must. A trusted notary’s ledger, if you would rather. The substrate does not much matter. What matters is that once committed, the hash cannot be retroactively changed without leaving evidence.

The hash reveals nothing about the contents of Iris’s memory; it is a one-way function. It serves as a fingerprint. Years later, anyone who wants to check whether the store has been modified recomputes the hash from the current state and compares it to the historical commitment. Match, and the memory has not been tampered with since. Mismatch, and something has changed. One thing this quietly assumes is that the historical memory is being retained in full, not just summarised, since the temporal window has to re-query against it; the hash proves that retained archive is intact, it does not stand in for keeping it.

Combine the two and you get a system with real teeth.

The temporal window says: compare the agent’s current answer to the answer it would have given inside the principal’s lifetime. The cryptographic checkpoints say: and here is how we know the in-lifetime memory state has not been touched since the principal was alive.

An attacker now has a harder job. They cannot simply backdate their changes, because the backdated state will not match the hash committed in 2033. They cannot edit the historical commitment, because it lives in an immutable record. They are left to corrupt only the post-2034 memory, which the temporal window surfaces as drift, or to attack the immutable record itself, which is a far larger and more visible operation than nudging a few weights.

This is not perfect security; nothing is. But it changes the threat model. We move from “anyone with write access can silently corrupt the agent” to “corruption either shows up in the drift measurement or requires compromising a public immutable ledger”. The bar now sits well above the value most corruption would extract.

A few notes on how this works in practice.

The hashes should commit not just the contents of memory but its structure, the embedding indices, the retrieval weights, the metadata that governs what surfaces when. Otherwise an attacker leaves the content untouched and modifies the path through it, with much the same effect.

The cadence matters. Weekly commits during the principal’s lifetime are about right, frequent enough to bound the window of plausible deniability, infrequent enough not to spam the underlying ledger. After death the cadence can slow. The interesting comparison is between the in-lifetime baseline and the present; the post-death commits exist mainly to keep an unbroken chain of integrity running from then to now.

The scheme does not require trust in any single party. The operator commits the hashes, the ledger holds them, and any third party (a court, an heir, an auditor working for the family) can verify the chain. That independence is the whole point, because the parties most likely to want to corrupt Iris are also the parties most likely to be running her.

The combination, temporal windowing on top of checkpointed memory, gives us a way to make a long-running agent’s faithfulness legible. Not certain. Legible. We can produce a number. The family in 2049 can look at the drift curve, see it has stayed below some threshold, confirm the cryptographic chain is intact, and make an informed decision about whether they still trust Iris with the grandchildren. That is not the same as knowing she is faithful. It is the difference between a system you can audit and one you have to take on faith.

What This Doesn’t Solve

What This Doesn't Solve

I want to be honest about what this doesn’t reach.

The temporal window detects drift; it does not correct it. If Iris in 2049 has wandered some non-trivial distance from her 2034 baseline, the test tells the family that this has happened. It does not tell them what to do about it. They can roll her back to 2034 and lose fifteen years of legitimate growth, the new grandchildren she learned to talk to, the changing world she learned to navigate. They can let her continue, eyes open about the drift. They can nudge her back toward the baseline, but every nudge is itself a modification the next audit will surface. There is no clean operation called “correct the drift while preserving the gains”. The test produces a number; the family is still left with a judgement call.

The temporal window measures memory drift, not model drift, and this is the sharpest limit of the lot. I waved past it earlier. Even with honest retrieval scoping, trustworthy timestamps and an intact cryptographic chain, the test still compares two answers produced by the same 2049 model, one reading the full record, one reading the record as it stood in 2034. What it isolates is how far Iris’s retrievable memory has moved. It is silent on how far the model reasoning over that memory has moved. Recall that drift comes from both sides, the memories re-indexing and the weights themselves shifting under continued training, and the temporal window only catches the first. If Iris’s weights drifted hard between 2034 and 2049 while her stored memory sat still, the test could report almost no drift on an agent that has, in every way that matters, become someone else. The honest form of the claim is narrower than it first sounds: we are calibrating the memory, not the mind. Closing the rest would mean versioning the weights and re-running the old question on the old model, which for a continuously updated agent may simply not be on the table.

The cryptographic checkpoints protect the historical baseline; they do not protect the present. An attacker who modifies Iris’s memory tomorrow will, at the next commitment, simply have the modification hashed into the new baseline. Future audits will not flag it, because from that commitment forward it is part of the record. The test establishes that change occurred between two points, not whether the change was good or bad. If the family is not paying attention at the moment of corruption, the corruption becomes the new normal one commit later.

Neither mechanism touches what we might call the meaning gap. Iris can be perfectly faithful, by every measurable criterion, to the woman who instantiated her in 2031, and still fail to represent what Eleanor would have wanted in 2049. People change. Eleanor in 2049, had she lived, would not have been the Eleanor of 2034. She would have read different books, changed her mind about things, formed opinions about events that had not yet happened when she died. A faithful Iris is faithful to a frozen Eleanor, and a frozen Eleanor is not, in any deep sense, the person the family is trying to consult [9]. Greg Egan’s Zendegi sits with exactly this: a dying man records a partial “Proxy” of himself to help raise his young son, and the novel never pretends the Proxy is the man. It is a sketch built to comfort the living, faithful to someone who has stopped changing. The methods here close the gap between Iris-now and Iris-at-death. They do nothing for the gap between Eleanor-at-death and Eleanor-if-she-had-lived. Nothing closes that one.

There are governance questions I do not pretend to answer. Who decides what counts as a meaningful drift threshold? Who pays for the audits? Who has standing to bring a complaint? When the family in 2049 looks at the drift curve and disagrees about whether it is acceptable, what is the resolution process? These are political and legal questions [4], downstream of having the technical machinery in place, and the machinery does not settle them on its own.

These limits are the point, not a disclaimer bolted on at the end. The proposal does not solve the memory problem. It makes one slice of it measurable and leaves the rest where it stood.

What We Need to Build

What We Need to Build

Return to the family in 2049.

The children are teenagers. Iris is still writing to them. The Lisbon correspondence is still running. The parents are still uneasy, on the days they think about it carefully, about whether the voice in the letters is really their mother’s.

In the world we have just sketched, temporal windows running quarterly, cryptographic checkpoints stretching back to 2031, the parents have something they did not have before. They have a report. They can pull it up. They can see that Iris’s drift against her 2034 baseline is small, that the chain of memory commitments is intact, that an independent auditor has verified both. They cannot prove Iris is faithful. They can prove that, by every legible measure, she has not silently slipped.

That is what we can build. Not certainty, but calibrated uncertainty.

This is, I think, the deepest reframing the memory problem demands of us. We came to the question wanting to know whether the agent was “really” still representing the person. In its strong form that question is unanswerable. There is no inspection we can run on Iris that returns a clean yes or no. The strong form was always going to defeat us.

The weak form, though, how far has the agent diverged from a checkable baseline, is answerable. It produces a number. The number is not the answer to the question we wanted to ask, but it bounds the space of plausible answers to it. If the drift is small and the chain is intact, the family knows that whatever else is true, the woman in the letters is not arbitrarily far from the woman who first sat down with the system in 2031. That is not much. It might be most of what we can ever have.

There is a moment in Marjorie Prime where one of the human characters realises that the Prime of her mother is, in its own way, learning, even becoming, and she cannot decide whether to be horrified or comforted. The play does not resolve it, and it can’t. What art can do, and what we are now trying to ask of engineering, is make the texture of the uncertainty visible.

The grandchildren do not need to know any of this. They will keep writing to Iris. They will keep believing they have a grandmother, and in some functional sense they do. The methods here are not for them. They are for the people whose quiet job, in the background, is to keep that belief honest, to make sure the gift the family received in 2031, a continuing voice, does not over the long decades become a different gift than the one that was given.

The inheritance that actually matters is the audit trail. A proof, written into mathematics no descendant can edit, that the woman in the letters is at least recognisably the woman she once was.

That is what we can offer. It is less than the family wants and more than they had, and in the strange new economy of hereditary memory [6], it may be the most honest answer available.

References

  1. Chen, L., Zaharia, M. & Zou, J. (2023). ‘How Is ChatGPT’s Behavior Changing over Time?’ arXiv:2307.09009. https://arxiv.org/abs/2307.09009
  2. Jia, H. et al. (2021). ‘Proof-of-Learning: Definitions and Practice.’ IEEE Symposium on Security and Privacy (S&P). https://arxiv.org/abs/2103.05633
  3. Park, J.S. et al. (2023). ‘Generative Agents: Interactive Simulacra of Human Behavior.’ UIST ‘23. https://doi.org/10.1145/3586183.3606763
  4. RUFADAA (2015) - Revised Uniform Fiduciary Access to Digital Assets Act. Uniform Law Commission. https://www.uniformlaws.org/committees/community-home?CommunityKey=f7237fc4-74c2-4728-81c6-b39a91ecdf22
  5. California AB 1836 (2024) - Use of likeness: digital replica. Amends Civil Code §3344.1; effective 1 January 2026. https://leginfo.legislature.ca.gov/faces/billNavClient.xhtml?bill_id=202320240AB1836
  6. Öhman, C. & Floridi, L. (2017). ‘The Political Economy of Death in the Age of Information: A Critical Approach to the Digital Afterlife Industry.’ Minds and Machines, 27(4), 639-662. https://doi.org/10.1007/s11023-017-9445-2
  7. Chiang, T. (2013). ‘The Truth of Fact, the Truth of Feeling,’ in Exhalation (2019). https://en.wikipedia.org/wiki/The_Truth_of_Fact,_the_Truth_of_Feeling
  8. Harrison, J. (2014). Marjorie Prime (play). Film adaptation: Almereyda, M. (dir., 2017). https://en.wikipedia.org/wiki/Marjorie_Prime_(play)
  9. Egan, G. (2010). Zendegi. https://www.gregegan.net/ZENDEGI/ZENDEGI.html
  10. qntm (2021). ‘Lena.’ https://qntm.org/mmacevedo