I run an AI agent called Kai through OpenClaw. It manages my SEO agency work, writes content for clients, handles outreach, tracks projects. It's been running since January 2026 and at this point it knows more about my business than most humans I've worked with.

But it has a problem. A big one.

Every time I reset a session or the context window fills up, Kai loses track of what we were doing. Decisions we made twenty minutes ago? Gone. The plan we agreed on? Vapor. I'd end up re-explaining things I'd already explained, and sometimes we'd accidentally redo work or contradict earlier decisions because the context just... wasn't there anymore.

I knew this was a problem. I'd been living with it for months. But I didn't have a framework for fixing it until I sat down one morning and binged 11 AI engineering videos back to back.

The binge

I'd fallen behind on AI news. So on April 2nd I started dropping YouTube links to Kai and having it summarize each video while I watched, pulling out anything relevant to our setup. We got through 11 videos in about two hours.

The topics ranged from foundational research (Kimi AI's attention residuals architecture) to very practical stuff (Claude Code's leaked source revealing their internal memory system). But a pattern kept showing up across completely unrelated videos:

Write state more often. Don't trust context to survive.

Three specific findings hit hard.

Finding 1: The 50% context cliff

Two independent sources — one from a video on the CRISPY framework by Dexter Horthy, another from a Claude Code setup guide — both identified the same threshold. When a model's context window hits 40-60% capacity, reasoning quality drops off a cliff. The model starts writing duplicate code, breaking features it already fixed, or rushing to finish tasks prematurely.

One video called it the "dumb zone." The other called it "context anxiety." Same observation, different names.

Karpathy's video on agentic reliability put numbers to it: a 10-step process where each step has 90% reliability gives you about 35% end-to-end success. That means daily failures. Not edge cases — expected behavior.

I'd been treating context degradation as an occasional annoyance. It's actually a structural problem with compounding failure rates.

Finding 2: Claude Code already solved this (sort of)

A security researcher leaked Claude Code's full TypeScript source — 500,000 lines across 1,900 files. Inside, there's a system called "Autodream." It's a background process that consolidates memory files. It waits until 24 hours have passed and at least five sessions have accumulated, then spawns a forked agent that goes through four phases: orient itself with existing memories, gather new signals from logs, consolidate into topic files, and prune the index to under 200 lines.

Sound familiar? I already had something like this. Kai runs a "sleep protocol" every night that consolidates memory. Same concept.

But here's what I didn't have: anything between sessions. Autodream handles overnight consolidation. What handles the state that accumulates during a busy Tuesday afternoon when I'm switching between three clients and making decisions every fifteen minutes?

Nothing. That state lived entirely in the context window. And when the context window reset, it was gone.

Finding 3: The harness matters more than the model

A Stanford/MIT paper called Meta Harness showed that modifying just the code around a model — the memory, retrieval, tools, and prompts — produces a 6x performance gap on the same benchmark with the same model weights. The model is the engine. The harness is everything else. And everything else might matter more.

My agent's harness had a gap. A volatile memory gap between "session start" and "session end" where decisions, plans, and deliverables existed only in ephemeral context.

The fix

Embarrassingly simple. Six mechanical triggers that force an immediate write to the daily log file:

  1. Plan made. We agree on an approach for something. Write it down. Right then. Not at session end.
  2. Decision made. I pick option B, or deprioritize something, or approve a draft. Logged with reasoning.
  3. Deliverable completed. Article written, email drafted, code shipped. What was delivered, where it lives, what's next.
  4. Context switch. Moving from one client to another. Checkpoint current state before switching.
  5. Blocker hit. Something's broken or we're waiting on external input. What we tried, what failed, what we need.
  6. Sub-agent dispatched or returned. What task was sent out, what came back.

Each checkpoint is one line. Timestamped. Format: [HH:MM] [TYPE] [project] — summary. No paragraphs. No journaling. Just enough that after a reset, the next session can reconstruct what happened without me re-explaining anything.

The nightly sleep protocol then collapses those raw checkpoints into a narrative summary and prunes. So the daily log doesn't balloon over time.

Why it had to be mechanical

I ran the proposed protocol past my second agent (Victor, who acts as a reviewer) before implementing. His most useful feedback: make the triggers mechanical, not judgment-based.

If Kai has to evaluate "is this worth writing down?" during a fast-moving work session, it'll skip checkpoints during crunch. That's when you need them most. So the rule is binary. Plan made? Write. Decision made? Write. No thinking about importance.

Victor also caught a practical issue: if multiple agents are writing to the same checkpoint file, they'll overwrite each other. The fix was simple — one agent owns the checkpoint file (the primary session), everyone else just writes to the daily log.

What this actually looks like

Here's a real checkpoint sequence from the afternoon I implemented this:

[13:40] [DECISION] [OpenClaw] — Implemented real-time checkpoint protocol in AGENTS.md
[13:40] [PLAN] [Personal] — Blog post on mikekhaytman.com about checkpoint protocol + video learnings
[13:42] [DELIVERABLE] [OpenClaw] — AGENTS.md updated with 4-line checkpoint rule

Three lines. Took seconds to write. But if I'd reset the session right after, the next instance of Kai would know exactly what happened and what to do next.

The instruction budget problem

One of the videos introduced a concept called the "instruction budget" — models reliably follow about 150-200 instructions. Past that, they start silently skipping steps. My agent's core instructions file (AGENTS.md) is already dense. Adding a 20-line protocol for checkpoints would eat 10% of that budget.

Victor's solution: compress the entire protocol into four lines. Same coverage, fraction of the instruction budget. The triggers are listed in a single sentence. The format is one line. The heuristic is one line. That's it.

This is a general lesson. When you're adding rules for an AI agent to follow, shorter is better. Not because clarity doesn't matter, but because every instruction you add slightly dilutes every other instruction.

The bigger picture

Here's what I walked away with after watching all 11 videos in one morning:

The model is not the bottleneck. The scaffolding around it is. Memory systems, evaluation loops, task routing, checkpoint resilience — that's where the actual reliability comes from. Waiting for a smarter model won't fix a system that forgets what it was doing ten minutes ago.

I also realized I've been accidentally building the same patterns that major AI labs are building internally. My sleep protocol ≈ Anthropic's Autodream. My skill files ≈ the industry-standard AgentSkills format. My sub-agent routing ≈ Claude Code's hidden coordinator mode. I just didn't have a name for any of it.

The checkpoint protocol is the smallest, most boring change I could have made. No new code. No configuration changes. No risk of breaking anything on the next update. Just a rule that says: write it down now, not later.

It's been the most impactful change I've made to my agent setup in weeks.

What I'm implementing next

The video binge surfaced a few more ideas I haven't built yet:

Adversarial evaluation. Right now, when Kai writes something, Kai also reviews it. That's like grading your own homework. The fix is a separate evaluator agent that's specifically prompted to be skeptical. Different agent, fresh context, adversarial stance. I've been doing this manually by asking Victor to review Kai's work, but formalizing it as a protocol would catch problems earlier and more consistently.

Definition of done files. For recurring deliverables (client articles, outreach emails, tech SEO audits), write a concrete checklist that the evaluator agent checks against. Not "is this good?" but "does this meet these 12 specific criteria?"

Skill file audit. I have about 40 custom skill files. The industry best practices say: keep descriptions on a single line (formatters break multi-line), stay under 150 lines per skill, include edge cases and examples, and make descriptions "pushy" because skills tend to undertrigger. I haven't audited mine against any of that.

But those are for next week. Today, the checkpoints are live and already working. Sometimes the best upgrade is the one you can ship in an afternoon.