From Prompts to Pipelines: The OS Layer AI Agents Are Missing

Prompt engineering was a coping mechanism. Context engineering is the real work

Dec 17, 2025

In 1974, Unix introduced a radical abstraction: everything is a file. Sockets, devices, processes, all accessible through the same interface. It sounds almost obvious now. It wasn’t. Before Unix, every piece of hardware needed its own special integration. Programmers loved it. (They did not love it.)

Fifty years later, we’re building AI agents the way people built software before Unix: every integration is bespoke, state is scattered, and there’s no unified way to navigate what the system knows. It’s as if we’ve forgotten lessons of the past.

A new paper argues we need the same shift for AI. And if you look at where Claude Code, Cursor, and MCP infrastructure are headed, you’ll see it’s already happening. Slowly. Painfully. But this is the way all massive change and transformation happens.

Model intelligence isn’t the problem

You’ve probably experienced this: an AI agent that works brilliantly for 5 minutes, then loses the thread. Same prompt, same model, same tools. But it forgot what you told it three turns ago. Or it hallucinates a information that doesn’t exist. Or it burns through your token budget loading context it didn’t need and then telling you it needs to compact the conversation.

You point this out. “You’re absolutely right, I apologize for the confusion!” it replies cheerfully. Then it hallucinates the same information but this time different wording that sounds slightly credible.

You try again, more firmly this time. You use capital letters. You add “IMPORTANT:” to the prompt. The model thanks you (again) for the clarification and does something entirely different but equally wrong.

The bottleneck isn’t “how smart is the model?” It’s “how well do we manage its context?”

That’s the argument in Everything is Context: Agentic File System Abstraction for Context Engineering. The authors propose that context engineering, not prompt engineering, is the discipline that matters now. Prompt engineering is asking “how do I phrase this so the model understands?” Context engineering is asking “why does this thing keep forgetting everything I tell it, and what infrastructure would fix that?”

What if everything the agent sees were a file?

The paper’s core move is borrowed from Unix: create a unified namespace where context sources (memory, tools, knowledge bases, human input) all appear as files and directories.

In their Agentic File System (AFS):

MCP servers, vector stores, APIs, logs, and user profiles mount into a single addressable space
Agents use a tiny tool surface: afs_list, afs_read, afs_write, afs_search, afs_exec
Backends vary underneath: relational DBs, vector stores, knowledge graphs. The agent doesn’t care. That’s the point.

Why does this matter? Because right now, every new capability means new tool definitions crammed into the context window. “Here are 47 functions you can call, please read all of them carefully before doing anything.” AFS flips that: instead of eagerly loading everything, the model discovers and loads what it needs on demand. Like a file system. Because it is one.

If you read my previous post on Docker’s Dynamic MCP, this is the same instinct elevated to a more general, and powerful, layer. AFS is what Dynamic MCP plus a good memory system might look like as a first-class OS abstraction. Or: what we’d build if we admitted that “just make the context window bigger” isn’t a strategy.

3 memory layers with actual semantics

Most “memory” in current agent systems is either RAG (retrieve relevant chunks and hope for the best) or naive caching (remember the last N messages until you don’t). The paper proposes something more structured:

History. Immutable log of everything. Every interaction, every tool call, every model output. Used for provenance, debugging, compliance. Your source of truth. (Path: /context/history/)

Memory. Indexed, structured views optimized for retrieval. Episodic memory for session summaries. Fact memory for atomic statements. User memory for preferences. Procedural memory for tool definitions. (Paths like /context/memory/agentID)

Scratchpad. Temporary workspace. Agents draft plans, test hypotheses, do intermediate reasoning here. Can be selectively promoted to Memory or archived to History. (Path: /context/pad/taskID)

All transitions between layers get logged with timestamps and lineage metadata. You can trace how context evolved, which is essential once you care about governance. Or once something goes wrong and you need to figure out why.

This is exactly what all the “memory” tools we are seeing crop up: structured memory graphs that let coding agents follow linked tasks over time instead of reinventing context every session. The alternative is your agent starting fresh every time like a goldfish with a very expensive API bill.

The pipeline that makes context manageable

The paper formalizes three components:

Context Constructor. Selects and prioritizes what enters the token window. Uses metadata to rank relevance. Compresses to fit budget. Outputs a manifest documenting what was included, what was excluded, and why. Finally, a system that can explain why it ignored your carefully written instructions. (I’d kill for just this part right now)

Context Updater. Controls when and how context flows in. Static snapshots for single-turn tasks. Progressive streaming for extended reasoning. Adaptive refresh for dynamic sessions. Keeps the window coherent as the conversation evolves. This is the part most current systems skip entirely, which is why your agent gets progressively more confused over time.

Context Evaluator. Closes the loop. Validates outputs against source context. Flags hallucinations and contradictions. Writes verified results back to Memory with versioning. When confidence is low, triggers human review and stores those corrections as first-class context elements. In other words: catches the model lying and takes notes for later.

The pipeline exists because GenAI has three architectural constraints that cascade upward:

Token windows are finite and expensive
Models are stateless between sessions
Outputs are probabilistic (same input can yield different outputs, none of them necessarily correct)

Once you internalize these constraints, “prompt engineering” definitely starts to feel like the wrong frame. Despite what every vibe coder and youtuber says, your success won’t strictly come from crafting clever prompts. You’re managing an information lifecycle. The clever prompts are a coping mechanism and you just realized you’ve been in denial this whole time.

Humans as co-processes, not just supervisors

Here’s a choice I appreciate: human annotations, overrides, and corrections get stored as explicit context artifacts under /context/human/.

They’re versioned, queryable, reusable. Humans become co-processes in the system. Their judgment enters the same context fabric as everything else. Not as an afterthought. Not as a Slack thread someone screenshots and pastes into a prompt.

If you’re building for regulated domains (healthcare, finance, legal), this matters enormously. You want human decisions and AI behavior recorded together, not scattered across logs, tickets, and random docs that nobody will find when the auditor asks.

The tools are already converging here

Look at what’s shipping:

Claude Code now has project memory and portions of a longer term memory. 3rd party tools like claude-mem extend it further into persistent teammates instead of stateless interns who forget your name every morning.

Cursor and VS Code has a growing ecosystem of memory banks and MCP-based backends (Graphiti, Cline Memory Bank) keeping project context alive across sessions.

Beads provides structured memory and issue graphs for coding agents executing long task chains.

These are all pragmatic answers to the same pain points: stateless models, limited windows, the need for long-lived governed context. The industry is converging on this whether or not anyone’s read the paper.

What’s still missing

The paper is more opinionated on traceability than most current tools:

Every context transition logged as a transaction
Evaluation results, confidence scores, and human overrides stored as auditable metadata

Most agents today still operate as “black box but helpful.” You ask it to do something. It does something. Maybe the right thing. You find out eventually?

Observability is improving, but the paper’s implicit argument is: if you want AI in mission-critical workflows, governed context is as important as the best possible prompts. Probably more important. Your clever prompt isn’t going to save you when the model confidently fabricates information that results in disaster in a mission critical workflow.

Where this leads

Frontier models will keep improving. Dare I say that’s tablestakes?

The paper gives you a frame for where non-model innovation happens:

Context infrastructure. Unified namespaces, progressive tool access, agentic file systems.
Memory systems. History, memory, scratchpad with lifecycle semantics, pruning, dedup.
Governance. Traceable pipelines you can replay, audit, correct.
Humans as data. Annotations and overrides as first-class context, not afterthoughts.

Claude Code, Cursor, Beads, Dynamic MCP. We’re seeing early expressions of these ideas. The paper is the architectural blueprint that explains why they’re all converging. Not because anyone coordinated. Because the problems are real and the alternatives are worse.

Trevin’s Notes

Discussion about this post

Ready for more?