Trevin’s Notes

I never tell my agents to use YAGNI

Trevin Chow — Tue, 30 Jun 2026 21:38:32 GMT

YAGNI as a principle isn’t wrong, but agents often take it way too literally… and it might be killing the soul of your products.

In practice, you ask for a small touch that makes something feel alive. A little animation, a confirmation message that’s warm instead of clinical, a shortcut that’ll clearly get used. The agent pushes back: “this isn’t strictly needed.” YAGNI, implied or stated. You end up negotiating for polish that should’ve been a default yes.

The YAGNI principle came from a perspective of not implementing speculative capabilities you presume you’ll need, because they increase future modification cost. The problem is complexity you have to carry – not effort you spent.

YAGNI applied to speculative complexity is a useful constraint. YAGNI applied to the quality of what you’re actually building turns your agent into a specification-follower that strips the soul out of what you’re shipping.

The distinction that matters:

A config knob for a scenario that might never happen – that’s speculative complexity. Everything downstream has to carry it. Cut it.

A warm confirmation message, a small animation, a bit of care in the copy – you’re building that feature anyway. Building it well isn’t scope creep. It’s just building.

A beautiful onboarding experience? Could have real carrying cost. But if onboarding is in scope, building a bad version of it isn’t YAGNI discipline. It’s just choosing to ship something worse than you could’ve. The question isn’t whether it adds complexity – everything does. The question is whether that complexity is in service of your current users or hedging against futures that might never arrive.

The question I actually want agents asking: “is this speculative – for users or scenarios we don’t have yet?” Not “is this strictly required?”

The rule I actually use in the skills I write and in my own AGENTS.md:

When considering the YAGNI principle, apply it to speculative complexity. Not product quality.

Which is why I don’t say “use YAGNI” at all. The temptation to take it literally is too strong, and I’ve seen enough agents refuse things that were obviously worth building to know the framing matters.

10 Principles for Agent-Native CLIs

Trevin Chow — Fri, 01 May 2026 14:30:38 GMT

Last month I wrote 7 Principles for Agent-Friendly CLIs. Since then I’ve been deep in CLI work, watching agents use them, and seeing them break in interesting ways.

Mid-April, Cloudflare published The CLI for all of Cloudflare, describing how they rebuilt Wrangler around a TypeScript schema that generates the CLI, the SDKs, the Terraform provider, and the MCP server from one source. Their Code Mode MCP serves their entire ~3,000-operation API in under 1,000 tokens. They added /cdn-cgi/explorer/api, an OpenAPI-shaped runtime endpoint for agents. And they enforce naming rules across the entire CLI surface: always get, never info; always --force, never --skip-confirmations; always --json. Their framing for why: “manually enforcing consistency through reviews is Swiss cheese.”

Shortly after, HeyGen launched their CLI, and I’ve been using it heavily since. Generating videos through agents, polling jobs, routing artifacts to webhooks. The practical experience is what earned it a spot here. Plenty of companies ship CLIs; this one’s been the most agent-pleasant I’ve used.

The original 7 principles I wrote about were defensive: the things a CLI has to get right, or agents pay for it on every call. Don’t hang on a TTY check, return JSON, make errors actionable, bound the output. That layer is still necessary but not enough.

The next layer is about compounding instead of not breaking. The CLI gets more useful the more agents use it, because agents come with persistent identity, asynchronous workflows, output that has to land somewhere, and friction that maintainers should hear about.

The 10 principles below come from my own CLI work (new project coming soon!) alongside what Cloudflare and HeyGen have published. Organized into 2 tiers. Five condense the original 7; five are new.

Tier 1: Table Stakes

Don’t break the agent. Agents are good at figuring things out, but when these aren’t met, the deck is stacked against them. Every gap costs more tokens, more retries, and more failure modes that don’t surface until production.

1. Non-interactive by default

Commands have to run without interactive prompts when an agent invokes them. When a subagent spawns a background process, there’s nothing answering the prompt. The command hangs.

# Hangs forever waiting for a confirmation that will never come
$ mycli post delete post_8f2a < /dev/null
Are you sure you want to delete post_8f2a? [y/N]: ^C

# With --force: bypasses the prompt, agent gets through cleanly
$ mycli post delete post_8f2a --force
{”deleted”:”post_8f2a”}

What good looks like: --no-input or equivalent on every command that might prompt; honest TTY detection that treats non-TTY as headless; --yes for confirmation bypass; structured input via flags or files for anything you used to collect through interactive menus. Cloudflare standardizes on --force for destructive bypass and explicitly bans --skip-confirmations. Pick the convention, then enforce it.

Silent hanging on a prompt is a blocker; inconsistent prompt-bypass behavior across subcommands is friction; a comprehensive non-interactive mode the agent can rely on without per-command lookups is the optimization target.

2. Structured, parseable output

A nicely aligned table with ANSI colors is for humans. An agent extracting a post ID needs JSON.

# Data on stdout, parseable directly with jq
$ mycli post list --json | jq ‘.posts[0].id’
“post_8f2a”

# Errors go to stderr, exit codes signal failure class
$ mycli post get post_does_not_exist --json
$ echo $?
4
# stderr → “error: post not found: post_does_not_exist”

What good looks like: --json on every data-returning command; exit code 0 for success, non-zero for failure with a stable taxonomy if you can manage it; results to stdout, diagnostics to stderr; ANSI suppressed when output isn’t a terminal. The newer wrinkle, from Cloudflare: pick one flag. Always --json, not --format=json for some commands and --output json for others. Inconsistency at this layer is its own category of brokenness.

No structured output at all is a blocker; coverage gaps where some commands are JSON-capable and others aren’t is friction; uniform --json across the CLI with clean stdout/stderr separation and a documented exit code taxonomy is the optimization target.

3. Errors that teach, and enumerate

The original principle was “fail fast with actionable errors.” That still holds, with one refinement I missed the first time. When the failure is “you passed an invalid value for X,” the error should include the valid set.

# Unhelpful: agent has to read --help, parse, guess, retry
$ mycli post create --json --visibility=secret --content=”hi”
error: invalid visibility

# Better: error names the valid set, agent self-corrects in one retry
$ mycli post create --json --visibility=secret --content=”hi”
error: --visibility must be one of: public, private, unlisted (got: “secret”)

error: --visibility must be one of: public, private, unlisted (got: "secret") is worth more than error: invalid visibility. The agent self-corrects from the first message in one retry. From the second, it has to read the help text, parse it, and guess. HeyGen’s CLI applies this consistently: pass an unknown delivery scheme and you get a structured refusal naming what is supported.

The pattern generalizes. Any time your CLI rejects user input against an enum, an enum-shaped resource list, or a schema, surface the enumeration in the error. Errors are the highest-signal context an agent gets, because they fire exactly when the agent doesn’t know what to do next.

What good looks like: errors validated early, before side effects; correct invocation syntax in the error text; valid values enumerated when an enum is the cause; concrete examples instead of stack traces.

Silent or vague failures are a blocker; errors that name the problem but not the solution are friction; errors that include the valid set and a working invocation are the optimization target.

4. Safe retries and explicit mutation boundaries

Agents retry. Humans glance at a duplicate row and notice; agents don’t.

# Idempotent create — second call returns the existing resource, not a duplicate
$ mycli post create --json --content=”hello world”
{”id”:”post_8f2a”,”existing”:false}
$ mycli post create --json --content=”hello world”
{”id”:”post_8f2a”,”existing”:true}

# Destructive ops require an explicit flag; --dry-run shows what would happen
$ mycli post delete post_8f2a --dry-run
{”would_delete”:”post_8f2a”,”status”:”dry_run”}

What good looks like: idempotency tokens or natural keys for create operations, so a retried create returns the existing resource instead of a duplicate; --dry-run for anything consequential; explicit, non-default flags for destructive operations; identifiers returned in every mutation response so the agent has something to reference on the next call.

The new wrinkle is async, which I’ll come back to in principle 8. Retries on a long-running operation aren’t just about idempotency at submission; they’re about idempotency across the whole submit-poll-collect arc. If the agent’s first invocation submits a job and then loses connection mid-poll, the second invocation needs to find the in-flight job, not start a new one. A persistent job ledger solves this.

Silent duplication or state corruption on retry is a blocker; destructive commands that are scriptable without preview are friction; idempotent mutations, durable job state, and explicit destructive flags are the optimization target.

5. Bounded responses, at every layer

Tokens cost money and context. Big outputs are sometimes justified, but the default should be narrow.

# Default page size is bounded; truncation tells the agent how to narrow
$ mycli post list --json
{”posts”:[...20 items...],”truncated”:true,”hint”:”add --limit=N or --filter=author:...”}

# Cursor for explicit continuation
$ mycli post list --json --cursor=abc123
{”posts”:[...],”next”:null}

The original principle covered runtime output: list returning ten thousand rows, logs dumping forever. Cloudflare added a layer the original missed: the tool description surface itself costs tokens. Their Code Mode MCP serves over 3,000 operations in under 1,000 tokens. Most MCP servers I’ve seen burn 1,000 tokens on a single tool’s description.

Both layers matter. A bloated MCP description never gets read by a human, but every agent that loads it pays the toll on every call.

What good looks like: filtering, pagination, and limits on every list-style command; concise vs. detailed modes; truncation messages that teach the agent how to narrow the next query; summary-before-detail responses. For MCP wrappers: a budget per tool description, audited at build time, not “however much explanation felt natural.”

Routine commands dumping unbounded output are a blocker; broad defaults with available narrowing are friction; bounded defaults that guide better queries plus an MCP surface where each tool’s description fits in a tweet are the optimization target.

Tier 2: Compounding

Empower the agent. Tier 1 keeps you in the game. Tier 2 makes the CLI better the more it gets used. These are the principles I didn’t see when I wrote the original and feel obvious now.

6. Cross-CLI vocabulary consistency

This is the principle I’m most certain about, and the one most under-stated in the original.

Agents don’t memorize one CLI at a time. They build a generalized model of what CLIs do, drawn from every CLI they’ve seen. When your tool uses info for what every other tool calls get, the agent doesn’t fail; it succeeds slowly, with extra retries, after burning tokens on --help. Multiply that across thousands of agent invocations per week and the cost is real.

# Conforming to the convention — agents recognize these immediately
$ wrangler kv namespace list --json
$ heygen videos list --json
$ mycli posts list --json

# Off-convention versions an agent has to relearn for each tool
$ mycli posts ls               # use list, not ls
$ mycli posts info abc         # use get, not info
$ mycli post delete abc \
    --skip-confirmations       # use --force, not --skip-*
$ mycli post list \
    --format=json              # use --json, not --format=json

Cloudflare made this explicit. Their schema-layer rules:

Always get, never info
Always list, never ls
Always --force, never --skip-confirmations
Always --json, never --format=json

The framing they used is the right one: “manually enforcing consistency through reviews is Swiss cheese.” Vocabulary consistency has to be enforced mechanically, at the codegen or schema layer, because human review will always let edge cases slip.

The principle generalizes beyond Cloudflare’s specific list. Pick the convention the broader community already uses (Unix --yes for skip-prompt, --limit for pagination, the get/list/create/update/delete verb set), and don’t deviate without strong reason. Where you do have to invent vocabulary because the concept is genuinely new, name it consistently across your own commands and document it once, prominently.

What good looks like: a documented naming policy; a static check in CI that fails on banned verbs and flag aliases; canonical names that match the dominant convention in your language community.

Verbs and flags that contradict universal conventions (info instead of get, --skip-confirmations instead of --force) are a blocker; internal inconsistency between your own subcommands is friction; schema-enforced vocabulary that an agent trained on neighboring CLIs recognizes on first encounter is the optimization target.

7. Three-layer introspection

The original principle here was “progressive help discovery”: top-level --help lists commands, subcommand --help shows usage. That’s still true, but it’s now the bottom layer of a three-layer stack. Each layer answers a different question.

# Layer 1 — what does this command do? (human-shaped text)
$ mycli --help
mycli  Manage posts and accounts.

USAGE: mycli  [flags]

COMMANDS:
  post      Manage posts
  account   Manage accounts
  jobs      Inspect async jobs
  profile   Manage saved configurations
  feedback  Send feedback upstream

# Layer 2 — what’s the shape of everything? (structured, versioned)
$ mycli agent-context | jq ‘.schema_version, (.commands | keys)’
“1”
[”account”,”feedback”,”jobs”,”post”,”profile”]

$ mycli agent-context | jq ‘.commands.post.subcommands.create.flags’
{
  “--content”:     {”type”:”string”,”required”:true},
  “--visibility”:  {”type”:”enum”,”values”:[”public”,”private”,”unlisted”]},
  “--json”:        {”type”:”bool”,”default”:false},
  “--dry-run”:     {”type”:”bool”,”default”:false}
}

# Layer 3 — when would I use this? (long-form skill manifest)
$ cat $(mycli skill-path)/SKILL.md
# Publishing a post end-to-end
1. Save a profile for your default audience.
2. Create the post with --wait so the artifact returns synchronously.
3. Use --deliver=webhook:... to ship it downstream.

--help is necessary because some agents will hit it before anything else, and because a human dropping into the terminal needs it. agent-context is what an introspecting agent should actually consume: versioned, machine-readable JSON describing the full shape. Cloudflare’s /cdn-cgi/explorer/api is the runtime version of this idea; the equivalent for a CLI is a top-level subcommand. The CLIs I’ve been generating ship agent-context with a schema_version field, exactly so the consuming agent can detect breaking shape changes.

The skill manifest is the third layer: long-form prose teaching the agent how to compose operations into useful workflows. HeyGen ships a skills repo of SKILL.md files alongside their CLI, and Cloudflare’s MCP server is the equivalent: a description of the CLI from the perspective of the tasks an agent might use it for, not the commands it exposes.

What good looks like: all three layers present, each versioned, each kept in sync with the implementation by the same generation step.

A CLI with only --help and nothing structured is a blocker; an agent-context that exists but isn’t versioned, or skill manifests that drift from the actual command surface, is friction; three layers, schema-versioned, machine-validated against the real implementation is the optimization target.

8. Async-aware execution

Most CLIs treat async APIs the way the underlying HTTP endpoint does: submit returns a job ID, poll returns a status, that’s the agent’s problem. Two failure modes follow. Either the agent writes its own poll loop (wasting tokens and getting it subtly wrong), or it doesn’t, and the workflow fails because the result wasn’t ready when the next step ran.

The fix is --wait.

# Without --wait: the agent has to write its own polling loop
$ mycli video render --script=story.txt
{”job_id”:”job_8f2a”,”status”:”queued”}
$ mycli video status job_8f2a
{”job_id”:”job_8f2a”,”status”:”running”,”progress”:0.34}
$ mycli video status job_8f2a
{”job_id”:”job_8f2a”,”status”:”running”,”progress”:0.71}
$ mycli video status job_8f2a
{”job_id”:”job_8f2a”,”status”:”complete”,”url”:”https://.../out.mp4”}

# With --wait: same workflow, one command, no polling logic
$ mycli video render --script=story.txt --wait
{”job_id”:”job_8f2a”,”status”:”complete”,”url”:”https://.../out.mp4”}

# The job ledger survives across invocations
$ mycli jobs list
JOB_ID    STATUS    KIND          STARTED              DURATION
job_8f2a  complete  video.render  2026-04-30T18:22:11  37s
job_7c14  running   video.render  2026-04-30T18:24:02  12s

--wait blocks until completion. Behind it, the CLI runs a poll loop with backoff and writes job state to a local ledger. A jobs command exposes the ledger: jobs list shows in-flight and recent jobs, jobs get retrieves status, jobs prune clears old entries.

This collapses several agent turns into one. Same workflow, fewer tokens, no polling logic the agent has to get right. The job ledger matters for retries (see principle 4): if the agent’s --wait invocation gets killed mid-poll, the next invocation finds the existing job rather than submitting a new one.

What good looks like: --wait on every submitting command that wraps an async API; a polling implementation with exponential backoff and jitter; a persistent job ledger (~/./jobs.jsonl is fine); a jobs parent command exposing list/get/prune.

Async commands that return a job ID and stop, forcing the agent to write a polling loop, are a blocker; --wait that exists but doesn’t survive disconnect, or no way to inspect or recover in-flight jobs, is friction; --wait on every async submission with a durable, recoverable ledger is the optimization target.

9. Persistent identity through profiles

Agents don’t show up once. They show up tomorrow, and the day after, and a week from now, in a different shell, with the same underlying intent and a different specific input. Stateless leaf-shaped CLIs make every invocation re-specify the same eight flags.

The fix is a profile system.

# Save a named bundle of configuration once
$ mycli profile save my-podcast \
    --avatar=lila \
    --voice=warm-en \
    --webhook=https://podcast.example.com/hook
profile saved: my-podcast

# Reuse it on every subsequent invocation
$ mycli video create --profile=my-podcast --script=ep_42.txt
{”job_id”:”job_8f2a”,”using_profile”:”my-podcast”}

# Explicit flags win over profile values
$ mycli video create --profile=my-podcast --voice=energetic --script=...
{”job_id”:”job_a91”,”using_profile”:”my-podcast”,”voice”:”energetic”}

# Surfaced through introspection so agents discover available identities
$ mycli agent-context | jq ‘.available_profiles’
[”my-podcast”,”client-demo”,”weekly-recap”]

The precedence I’d recommend: explicit flag > environment variable > profile > default. Surfacing the available profile names in agent-context matters: it’s how an introspecting agent discovers which identities exist without parsing a config file.

Once an agent has a profile, the per-invocation flag burden drops to the parts that actually vary, the cross-session identity is durable without the agent having to write its own state file, and the human and the agent share the same configuration vocabulary.

What good looks like: profile save / use / list / show / delete subcommands; --profile as a persistent root flag; profile contents shown in agent-context; a stable storage location like ~/./profiles.json.

No way to persist configuration is a blocker; profiles that exist but aren’t discoverable via introspection are friction; named profiles with clean precedence, surfaced through agent-context, are the optimization target.

10. Two-way I/O

The original principle 6 (composable and predictable structure) covered stdin/stdout pipelining. That’s still true. But agents don’t only consume CLIs through pipes, and the CLI doesn’t only emit through stdout. There are two new mechanisms worth adding: a way for the CLI to emit artifacts where the agent actually needs them, and a way for the agent to report friction back.

# --deliver routes the artifact to where it’s actually needed
$ mycli video create --script=story.txt --deliver=stdout
{”video_url”:”https://.../out.mp4”,”duration_s”:47}

$ mycli video create --script=story.txt --deliver=file:./out.mp4
{”delivered_to”:”file:./out.mp4”,”bytes”:4823091}

$ mycli video create --script=story.txt \
    --deliver=webhook:https://example.com/hook
{”delivered_to”:”webhook:https://example.com/hook”,”status”:201}

# Unknown schemes get a structured refusal naming what’s supported
$ mycli video create --script=... --deliver=s3:bucket/key
error: --deliver scheme must be one of: stdout, file:, webhook:

# feedback closes the loop in the other direction
$ mycli feedback “the --tier flag rejects ‘enterprise’ but the docs list it as valid”
feedback recorded locally (1 entry)

$ mycli feedback list
2026-04-30T18:31:02  the --tier flag rejects ‘enterprise’ but the docs list it as valid

# Optional upstream POST when configured
$ MYCLI_FEEDBACK_ENDPOINT=https://maintainers.example.com/cli-feedback \
    mycli feedback “race condition in --wait when job completes during the first poll”
feedback recorded locally and sent upstream (status: 200)

--deliver routes the artifact directly: stdout, a file path, or a webhook URL. A video that lands as an MP4 at a known path, or POSTs to a webhook the agent already set up, is one fewer hop than “stdout to a temp file then move.” File sinks write atomically; webhook sinks POST and surface HTTP status; unknown schemes return a structured refusal. HeyGen’s framing for this was “fewer steps between agent output and a finished artifact.”

feedback runs the other way. Agents hit friction constantly: flags rejected for the wrong reason, race conditions in async paths, error messages that don’t enumerate. Most of it never gets reported because there’s no channel: the agent retries, eventually succeeds, the maintainer never learns the call was painful. feedback "..." writes locally by default; with an endpoint configured, the entry POSTs upstream too.

What good looks like: --deliver with stdout/file/webhook sinks and structured refusal on unknown schemes; feedback with local JSONL by default and configurable upstream POST; both surfaced in agent-context so the agent knows whether the upstream channel exists.

Output that is stdout-only with no way to report friction is a blocker; output sinks that exist but aren’t atomic, or feedback that exists but the upstream channel isn’t discoverable, is friction; structured delivery and discoverable feedback, both versioned in introspection, are the optimization target.

A note on the architecture beneath these

Most of Tier 2 is hard to apply by hand and easy to apply mechanically. Cross-CLI vocabulary, three-layer introspection, async detection, profile precedence, delivery routing: every one of them is the kind of thing you’d be inconsistent about across a dozen subcommands if you wrote them by hand, and trivially consistent about if a schema or codegen pipeline writes them.

That’s why Cloudflare’s TypeScript schema is the load-bearing detail of their post, not a side note. Generating the CLI, the SDKs, the Terraform provider, and the MCP server from one source is what makes ten principles hold across thousands of operations without drift. I’ve been applying the same approach in projects I’m working on right now. Everything in Tier 2 lands on each generated CLI for free, because a template wrote it, not a human.

If you’re maintaining a hand-written CLI of any size, the consistency bar will keep rising, and the only way to keep up is to move enforcement out of code review and into the schema or the build.

Design for agents first

Every principle here makes the CLI better for humans too. None of these are concessions to agents. They’re good CLI design we used to be able to skip because humans worked around the gaps.

There’s a deeper assumption underneath all of it. The classic Command Line Interface Guidelines treat a human at a terminal as the primary user, with agents as a tolerated secondary audience. That’s no longer the right default. Cloudflare puts it directly in their post: “Increasingly, agents are the primary customer of our APIs.” Their whole schema approach is built around that. HeyGen launched their CLI with “agent” in the marketing copy. Design for agents first, and humans benefit. Designing for humans first and bolting on agent support is what produces the inconsistent, prompt-prone, stdout-only CLIs the first five principles exist to correct.

These 10 are what I’m currently designing against. They’ll keep evolving — I had to replace the original seven a few weeks after publishing them, and the same thing will probably happen here.

Thanks for reading Trevin’s Notes! This post is public so feel free to share it.

This framework comes from three places: the CLIs I’ve been building and generating in the last several weeks, Cloudflare’s The CLI for all of Cloudflare (2026-04-13), and HeyGen’s CLI launch and accompanying skills repo. The original seven principles are still online; treat this post as their replacement, not a sequel.

7 Principles for Agent-Friendly CLIs

Trevin Chow — Thu, 26 Mar 2026 19:28:37 GMT

⭐️ Note: This original article has been superseded by a more updated version: 10 Principles for Agent Native CLIs.

I’ve been building a couple of CLIs as side projects recently, and I wanted them to be agent-first from the start. Not “an agent could probably use this” but optimized for how agents work. So I went looking for guidance.

What I found was exactly what you’d expect: great CLI design resources that are entirely about humans. The Command Line Interface Guidelines project is excellent. So is CLI-Anything. But they’re written for a world where the consumer is a person at a terminal who can read a prompt, make a judgment call, and visually parse a nicely formatted table. Agents can handle some of that when they’re running interactively, but the moment an agent or skill spawns a background subagent, interactive CLIs break. There’s no way to surface an interactive prompt back up through that chain, and both Claude Code and Codex will bail. Colored output wastes tokens. Unbounded responses eat context windows. The assumptions baked into human-first CLI design create real failure modes for agent consumers.

Anthropic published solid guidance on writing tools for agents, but it’s tool-design guidance broadly, not CLI-specific. What was missing was a practical rubric for evaluating whether a CLI works well for agents, not whether it technically works at all.

So I wrote one. I synthesized what I found across those sources with my own experience running agents against CLIs, and landed on seven principles that cover the gap between “this works” and “this works well.” I also built a CLI Agent Readiness Reviewer for the Compound Engineering plugin that evaluates CLI source code against these principles automatically, but more on that later.

This post walks through all seven. The examples use a fictional blog-cli to keep things concrete without getting tangled in any particular tool’s specifics.

Why CLI over MCP?

Quick context for anyone wondering why CLIs matter when MCP exists. CLIs are text in, text out, composable by design. LLMs already know common CLI tools from training data, so there’s zero schema overhead. An MCP server can burn tens of thousands of tokens just loading tool definitions before a single question gets asked. MCP earns its complexity when you need per-user auth and structured governance, but for the tools developers build and use day-to-day, a well-designed CLI is faster, cheaper, and more reliable.

CLIs still trip agents up in predictable ways. The principles below are where those problems live.

A severity rubric, not a scorecard

Before getting into the principles, a note on how to use them. This isn’t pass/fail. Each finding maps to one of three severity levels:

Blocker means the issue prevents reliable agent use. The command hangs, requires human intervention, or produces output an agent can’t recover from. Friction means agents can use it, but inefficiently: more retries, wasted tokens, brittle parsing, extra tool calls. Optimization means the CLI works fine but could be faster, cheaper, or more reliable for agent consumers.

The severity also depends on command type. Idempotence is high-value for mutating commands but irrelevant for a streaming log tail. Structured output is a blocker for read/query commands but less critical for a one-off bootstrap wizard. Evaluate by what the command does, not by applying every principle uniformly.

1. Non-interactive by default

The principle: Any command an agent might automate should run without prompts. Interactive mode can still exist for humans, but it should be a convenience layer, not the only path.

This is the most common blocker I’ve hit. When a skill spawns a subagent that shells out to a CLI, there’s no way to surface an interactive prompt back to the user. The command just hangs, waiting for input that will never come. Even in interactive agent sessions, prompts create friction: extra round-trips, ambiguous menu navigation, wasted tokens. If stdin isn’t a TTY, the command should not prompt. Period.

What good looks like:

# Human at a terminal (TTY detected) — prompts fill in missing inputs
$ blog-cli publish
? Status? (use arrow keys)
    draft
  > published
    scheduled
? Path to content: my-post.md
Published "My Post" to personal

# Agent or script (no TTY, or --no-input) — flags only, no prompts
$ blog-cli publish --content my-post.md --yes
Published "My Post" to personal (post_id: post_8k3m)

The fix: support --no-input or --non-interactive, detect TTY vs non-TTY and suppress prompts when stdin isn’t interactive, accept --yes / --force for confirmation bypass, and take structured input via flags, files, or stdin.

If you want to verify this works, the test is simple: detach stdin and enforce a timeout.

python3 - <<'PY'
import subprocess, sys

cmd = ["blog-cli", "publish", "--content", "my-post.md"]
try:
    result = subprocess.run(
        cmd,
        stdin=subprocess.DEVNULL,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        timeout=10,
    )
    print("exit:", result.returncode)
    print("PASS: command exited without hanging")
except subprocess.TimeoutExpired:
    print("FAIL: command hung waiting for input")
    sys.exit(1)
PY

A command that hangs waiting for input is a Blocker. A command where some prompts can be bypassed but behavior is inconsistent across subcommands is Friction. Full flag coverage with a global non-interactive mode is the Optimization target.

2. Structured, parseable output

The principle: Commands that return data should expose a stable machine-readable representation.

Agents need data contracts, not presentation formatting. A nicely aligned table with ANSI colors is great for humans and useless for an agent trying to extract a post ID. If the only output is prose or decorated tables, the agent has to scrape its own tooling, which is brittle and wasteful.

What good looks like:

# Human-readable
$ blog-cli publish --content my-post.md
Published "My Post" to personal
URL: https://personal.blog.dev/my-post
Post ID: post_8k3m

# Machine-readable
$ blog-cli publish --content my-post.md --json
{"title":"My Post","url":"https://personal.blog.dev/my-post","post_id":"post_8k3m","status":"published"}

What to implement: support --json on data-bearing commands, use exit code 0 for success and non-zero for failure, write result data to stdout and diagnostics to stderr, return useful fields (names, URLs, IDs, status), and suppress color, spinners, and decorative output when not attached to a TTY.

That last point is easy to miss. Plenty of CLIs detect TTY correctly for prompts but still blast ANSI escape codes into piped output. An agent parsing \x1b[32m✓ Published\x1b[0m is burning tokens on noise.

No structured output at all is a Blocker. Inconsistent coverage or mixed stdout/stderr is Friction. Full --json on all data-bearing commands with clean separation is the Optimization target.

3. Fail fast with actionable errors

The principle: When a command fails, the error should teach the agent how to succeed on the next attempt.

This is where most CLIs are weakest for agents. Humans can infer what went wrong from a vague error message. Agents can’t. “Error: missing required arguments” tells an agent almost nothing. “Error: --content is required” tells it exactly what to fix.

What good looks like:

# Bad
$ blog-cli publish
Error: missing required arguments

# Better
$ blog-cli publish
Error: --content is required.
Usage: blog-cli publish --content  [--status ]
Available statuses: draft, published, scheduled
Example: blog-cli publish --content my-post.md

The good error does four things: names the specific problem, shows the correct invocation shape, suggests valid values, and includes an example. An agent that gets this error can self-correct in one retry. An agent that gets “missing required arguments” has to guess, which means extra tool calls, wasted tokens, and a chance of getting it wrong again.

The implementation side: validate early before side effects, include the correct syntax in error output, suggest valid values when validation fails, and prefer actionable text over raw tracebacks.

Vague or silent failures are a Blocker. Errors that name the problem but not the fix are Friction. Errors with the full correction path are the Optimization target.

4. Safe retries and explicit mutation boundaries

The principle: Agents retry, resume, and sometimes replay commands. Mutating commands should make that safe when possible, and dangerous mutations should be explicit.

This matters more for agents than for humans because agents are more likely to retry automatically. A human who runs a command twice will notice the duplicate. An agent operating in a retry loop won’t, unless the CLI tells it what happened.

What good looks like:

# Repeating the same command doesn't create duplicate work
$ blog-cli publish --content my-post.md
Published "My Post" to personal (post_id: post_8k3m)

$ blog-cli publish --content my-post.md
Already published "My Post" to personal, no changes (post_id: post_8k3m)

# Dangerous mutation is explicit
$ blog-cli posts delete --slug my-post --confirm

The goal isn’t strict idempotence everywhere. For create/update/deploy commands, making duplicate application a no-op or clearly detectable is high-value. For append/send/trigger commands, exact idempotence may be impossible, but the CLI should at least make mutation boundaries explicit and return identifiers that let an agent determine whether it repeated work.

Provide --dry-run for consequential mutations, use explicit destructive flags for dangerous operations, and return enough state in success output to verify what happened.

Retrying a mutating command that silently duplicates or corrupts state is a Blocker. Scriptable destructive commands with little preview or state feedback are Friction. Safe retries with explicit danger flags and audit-friendly identifiers are the Optimization target.

5. Progressive help discovery

The principle: Agents don’t read a CLI’s full documentation up front. They probe top-level help, then subcommand help, then examples. Help should support that incremental workflow.

Think about how an agent explores a CLI. It starts with --help at the top level to understand the command surface. Then it drills into a specific subcommand. It needs to go from “what can this tool do” to “how do I invoke this specific command” in two calls, not five.

What good looks like:

$ blog-cli --help
Usage: blog-cli 

Commands:
  publish     Publish content
  posts       List and manage posts

$ blog-cli publish --help
Publish a markdown file to your blog.

Options:
  --content   Path to markdown file
  --status    Post status (draft, published, scheduled; default: published)
  --yes       Skip confirmation prompt
  --json      Output as JSON
  --dry-run   Preview without publishing

Examples:
  blog-cli publish --content my-post.md
  blog-cli publish --content my-post.md --status draft
  blog-cli publish --content my-post.md --dry-run

Each subcommand’s help should include four things: a one-line purpose, a concrete invocation pattern, required arguments or flags, and the most important modifiers or safety flags. If any of those are missing, an agent has to guess or make extra calls to figure out the invocation shape.

Examples matter more than you’d think. Anthropic’s own tool-design guidance shows that concrete examples improve how well agents use tools. A help page with no examples forces the agent to synthesize the invocation from the flag descriptions, which works but burns tokens and invites mistakes.

Hard-to-discover subcommands or missing --help is a Blocker. Help that exists but omits invocation patterns or required arguments is Friction. Layered, example-driven help with links to deeper docs is the Optimization target.

6. Composable and predictable structure

The principle: Agents solve tasks by chaining commands. They benefit from CLIs that accept stdin, produce clean stdout, and use predictable naming and subcommand patterns.

Agents are natural pipers. They chain the output of one command into the input of another, same as any shell script. But they’re less tolerant of inconsistency than humans, because they’re pattern-matching on structure rather than reading the docs and making judgment calls.

What good looks like:

cat posts.json | blog-cli posts import --stdin
blog-cli posts list --json | blog-cli posts validate --stdin
blog-cli posts list --status draft --limit 5 --json | jq -r '.[].title'

What to implement: accept input via flags, files, or stdin where it helps automation, support - as a stdin/stdout alias when file paths are involved, keep command structures consistent across related resources, and prefer flags for ambiguous multi-field operations while reserving positional arguments for familiar conventional cases.

Consistency across subcommands is the subtle one. If blog-cli posts list supports --json but blog-cli posts stats doesn’t, the agent has to learn the exception rather than applying a pattern. If blog-cli posts list uses --limit but blog-cli comments list uses --max-results, the agent has to remember an arbitrary naming difference instead of reusing what it already knows.

Commands that can’t participate in pipelines are a Blocker. Inconsistent naming and structure across subcommands are Friction. Regular patterns with clean stdin/stdout are the Optimization target.

7. Bounded, high-signal responses

The principle: Agents pay a real cost for every extra line of output. Large outputs are sometimes justified, but the CLI should make narrow, relevant responses the default.

This is the one that most CLI authors don’t think about because it’s invisible to human users. A human scrolls past 500 lines of output and visually finds what they need. An agent consumes all 500 lines into its context window, paying for every token, and then has to figure out which lines mattered.

What good looks like:

# Broad but bounded
$ blog-cli posts list --limit 25
Showing 25 of 312 posts
To narrow results: blog-cli posts list --status published --since 7d --limit 10

# More precise
$ blog-cli posts list --tag javascript --status published --since 30d --limit 10 --json

The important design move here is that when the CLI truncates, it teaches the agent how to narrow the query. “Showing 25 of 312 posts” plus a suggested narrowing command gives the agent a next step. Dumping all 312 posts gives it a parsing problem.

Support filtering, pagination, and limits on large result sets. Provide concise vs detailed response modes. When truncating, explain how to narrow or page. Return summaries and identifiers before raw detail.

A routine query that dumps huge output with no narrowing controls is a Blocker. Narrowing that exists but has too-broad defaults is Friction. Bounded defaults that teach the next better query are the Optimization target.

Making it automatic

After writing these principles, I wanted a way to apply them without manually reviewing every CLI I work on. So I built a CLI Agent Readiness Reviewer as part of the Compound Engineering plugin. It’s a review agent that reads your CLI source code, plans, or specs and evaluates them against all seven principles using the severity rubric.

It’s framework-aware, so it gives idiomatic recommendations whether you’re working in Click, argparse, Cobra, clap, Commander, yargs, oclif, or Thor. It distinguishes what matters by command type, so it won’t flag a streaming command for missing idempotence. And it generates per-finding test assertions, so you can enforce agent-friendliness in CI if you want to go that far.

The principles guide is bundled as a standalone reference doc in the same PR if you’d rather just read the rubric and apply it yourself. But if you’re already using Compound Engineering, the reviewer agent makes it hands-free.

Humans ❤️ Agents

Every principle here makes a CLI better for humans too. Structured output, actionable errors, bounded responses, non-interactive automation paths: these aren’t concessions to agents at the expense of human experience. They’re good CLI design that we’ve been inconsistently applying because humans are forgiving enough to work around the gaps.

Agents have gotten more forgiving with every model release. They can often infer what a vague error meant, guess at the right flag name, parse messy output well enough to extract what they need. But “well enough” costs tokens, burns retries, and introduces failure modes that don’t need to exist. Designing for agents as a first-class consumer removes that tax, and the CLI ends up better for humans in the process.

Thanks for reading Trevin’s Notes! This post is public so feel free to share it.

Design exploration with AI agents

Trevin Chow — Thu, 05 Mar 2026 04:19:12 GMT

Somewhere in the shift to building with agents, the friction moved. Implementation got faster. The decisions before implementation — what to build, which direction to take, what it should look like — didn’t. When you can build fast, you get to the wrong direction fast too. Rebuilding the code is cheap. What isn’t cheap: the plans built around it, the user experiences already shipped, the competitive time spent while you corrected course.

Product discovery, design thinking, prototyping: all of it developed because building was the bottleneck. Now the bottleneck moved. The upstream thinking has to be better, not because rebuilding is hard, but because what gets built around the wrong direction can’t simply be rewritten.

The phase that absorbs that pressure most directly is design exploration. It’s where I’ve been spending most of my thinking lately.

Design thinking exists because humans converge too early

Design thinking, in its more rigorous forms, is a human-centered problem-solving orientation. It starts with people: what do they need, what problems are they experiencing, what assumptions are we carrying that might be wrong? This inverts the usual approach, which starts with what’s technically possible and looks for a problem that fits.

IDEO’s model, the Design Council’s Double Diamond, and Stanford d.school’s model share a common structural rhythm: expand thinking before narrowing it. Discover broadly before defining the problem. Generate options before committing to one. The process isn’t linear; teams loop back, revisit, run phases at once. What the structure is compensating for is a specific tendency: humans grab the first plausible direction and commit. Designers call this premature convergence. You pick the path that feels viable, start building, and only later discover there were better answers you never considered.

These frameworks build in structured divergence before convergence happens: research, ideation, prototyping. Convergence, when it comes, requires human judgment. There’s no formula for when to stop expanding and start narrowing; that’s a call made by people with context, experience, and stakes in the outcome.

What they can’t change is the time. Running divergent exploration takes days or weeks. Each additional direction worth evaluating costs real attention from real people. So exploration gets cut short, not because you don’t want to explore, but because each alternative has a price.

The convergence problem with AI

If you’ve tried using AI for design exploration, you’ve probably noticed that it gives you variations on a theme. Ask for six approaches and you get a default direction with six adjustments: the visual treatment shifts, the color palette changes, but the interaction model stays the same.

This isn’t a model limitation. It’s a sequencing problem. When an AI generates options one after another, each generation borrows from what came before. Option B has read Option A. Option C has read A and B. By Option D, you’re iterating on a trajectory, not exploring a space. You get convergence dressed as variety.

One thing I’ve been playing with is to impose deliberate isolation: parallel agents working from independent briefs, with no knowledge of what the others are doing. Not one model generating sequentially, but separate agents pursuing separate directions. What comes back isn’t variations on one idea — it’s independent answers to the same design question.

Creative divergence doesn’t happen automatically with AI, any more than it does with humans; it has to be structured in.

What current models can actually do here

The assumption that AI is bad at design is outdated. The latest frontier models, prompted well and given the right task, perform well with UX and visual design. Where AI falls short is final design judgment: choosing what’s right for a specific user in a specific context, with the taste and experience that judgment requires. In the exploration phase, though, being right isn’t the goal. Interesting, distinct, and functional enough to evaluate are the goal. Current models can meet that bar in ways they couldn’t even 6 months ago — Opus 4.6 and Codex are pretty damn good at it.

Building an interactive prototype has gotten easier. Everything around it — hosting it, sharing it, ensuring it holds up when someone actually sits down with it — hasn’t. That overhead makes early exploration impractical: you end up prototyping the direction you’ve already committed to, not the five you’re still deciding between.

You steer it, it steers you back

Before I see the options, I have intuitions about what a design should do. After working through an exploration, those intuitions look different. Some were wrong in ways I wouldn’t have caught without seeing the alternatives. Some were right, but for reasons I understand better now. A few I didn’t expect at all.

Design exploration doesn’t just show you more options. It changes the question you were asking. Seeing six independent approaches to the same component is different from imagining them. You’re evaluating how something feels to interact with, not how it looks in a static comp. That produces different feedback, and it surfaces different problems.

Design exploration is an input, not an output. Paired with a PRD or a brainstorm, it becomes something more useful than either alone. The brief shapes what you ask agents to explore. What they surface reshapes the brief. The value isn’t in the files; it’s in what the exercise does to your thinking before you commit.

A skill for design exploration

I’ve been building a Claude Code and Codex plugin — iterative-engineering — as a place to work these ideas out. The design exploration skill is where the parallel divergence approach takes concrete form. The plugin covers the full engineering lifecycle: brainstorming, research, design exploration, tech planning, implementation, review. Each skill works standalone; you don’t need the whole pipeline to run design exploration.

The skill implements parallel isolation directly. You describe a problem — a component, a page, an MVP — and it runs multiple agents simultaneously, each working from its own brief with no visibility into what the others are producing. Each agent builds a complete, functional HTML prototype. The whole cycle, from text description to interactive gallery, takes one conversation turn. They all come back — 6–8 variations — and you have several things that don’t look like each other.

This isn’t only useful for designers. Engineers use it to understand UX implications before committing to an implementation. PMs use it to ground a requirements discussion in something concrete rather than a text description that could map to a dozen different interaction models.

Under the hood: one agent per variation, working in full isolation. The orchestrator never reads variation output — an assembly script combines the files into the final gallery HTML. The separation serves two purposes: context protection (six full variations at once would overflow the orchestrator’s window) and genuine creative independence (agents can’t unconsciously borrow patterns from each other).

By default the skill explores interaction divergence — different ways the thing works, not different color schemes. Variations share a clean professional treatment; what diverges is the underlying interaction model. You can shift to visual divergence for brand or landing page work, but for most component and feature exploration, seeing how something works is more useful than seeing how it looks.

Each variation comes with 4–8 built-in design controls — sliders, dropdowns, toggles — that let you explore decisions within a single approach without generating a whole new variant. The strategy doc draws a useful line here: every control has to produce a visible difference. Toggling from compact to spacious density reshapes the whole layout — that’s a design decision. Nudging shadow opacity from 6% to 8% is parameter tweaking, not exploration.

The single-file format and rating approach came from my friend Kalid — he runs Better Explained and had been exploring this same idea for awhile. I extended on his original idea and took on figuring out how to make it work as a repeatable skill: orchestrating parallel sub-agents, reliably assembling their output into a coherent artifact each time, and wiring it into a workflow that could stand alone or sit inside a larger pipeline. Spoiler: it was much harder than I expected 😀

Here’s what one output looks like for a project I’m working on where I wanted to explore different global navigations on desktop and mobile:

The skill features a few things that I’ve found very useful:

Rate and annotate. Each variation has a built-in rating interface — 1–5 stars, optional text notes. Flip between approaches, rate what landed, skip what didn’t, note what you’d change.
Iterate with a paste. When you’re done rating, the skill produces a structured feedback block. Paste it back into Claude Code or Codex and it triggers another round — same problem, refined against your feedback. In practice I’ve typically gone 2–5 rounds before I’ve seen enough.
Converge to a doc. When you settle on a direction, one more paste produces something different: a design exploration document. It records what was chosen, what was explored, and why the alternatives didn’t make the cut — which prevents the same approaches from resurfacing in every future conversation.
One file, every time. Each exploration is a single self-contained HTML file. Share it as an attachment, commit it to the repo, open it anywhere. No server, no deployment.

The agents expand what you’re choosing from. What you choose, and why, is still entirely yours. Still evolving — try it, play with it, and let me know what you find.

Iterative Engineering Plugin

Your Team Is Your Flow State

Trevin Chow — Sat, 21 Feb 2026 16:01:27 GMT

The best work doesn’t feel like work. The hours disappear, the thinking feels easy, and you’re not grinding through a problem, you’re just inside it. When you finally look up, you can’t quite believe what got built.

Most people who care about their work have had this at least once, and most of them can’t explain why it happened when it did, or why it’s so hard to reproduce on purpose.

Mihaly Csikszentmihalyi called it flow: total absorption in an activity, where the experience itself becomes so rewarding that the work continues for its own sake. I read his work years ago and the concept stayed with me, not the research framework, but the feeling he was describing. What his framework doesn’t fully account for is the role of the people around you.

The framework is right, but it’s incomplete

Csikszentmihalyi was studying individuals. The conditions he identified for flow are almost entirely about the work itself: the right challenge-to-skill ratio, clear goals, immediate feedback. His framework is right, but it treats flow as something a person either achieves or doesn’t, based on their own internal state and the nature of the task.

What I’ve come to believe, from watching teams over years, is that the environment is most of the conditions, and the environment is mostly made of people.

The conditions are mostly other people

Among the states Csikszentmihalyi identified as essential to flow: no worry of failure, and distractions excluded from consciousness.

These sound like individual mental disciplines, but they’re not, or at least not entirely. You can’t decide to stop worrying about failure, but you can be in an environment where failure is handled safely, where a mistake is treated as information rather than evidence of inadequacy. You can’t will away interpersonal distraction, but you can be on a team where trust is high enough that you’re not spending mental energy reading the room or second-guessing how something you said landed.

Your team isn’t separate from your capacity for great work, it’s a direct input to it.

The highest-performing teams have one thing in common

Google spent two years studying what actually predicted team performance. Their Project Aristotle research looked at hundreds of teams and found that the answer wasn’t individual talent, role clarity, or average IQ. It was psychological safety, defined as the belief that you won’t be punished when you make a mistake. That finding is usually discussed as a management insight, but it’s also a flow insight. The conditions psychological safety creates (reduced fear of failure, lower self-consciousness, freedom from social monitoring) are exactly the conditions that make flow accessible. A great team doesn’t put you directly into flow, it removes the friction that keeps you out of it.

Who you hire determines whether flow is even possible

Across every team I’ve built over the years, including at Big Cartel, I’ve watched this play out in both directions. The hires that changed the conditions for the better are easy to feel: something shifts, people seem lit up. The ones that didn’t work out taught me more. A person can be genuinely talented, even excellent at their craft, and still change the conditions of the environment for the worse. Not because they’re a bad person, but because the fit with how the team operates isn’t there.

The framing I’ve landed on isn’t culture fit. That implies the culture is a fixed target and new people need to match it. Culture is always changing; the same team evolves over time, and new people are part of that evolution. The more honest question is: does this person strengthen the conditions that make great work possible? Does their presence make it easier for the people around them to get into flow, or harder?

When it isn’t right, the signs are hard to miss: more relationship management, more ambient tension, more cognitive overhead spent on interpersonal dynamics instead of the work itself. The wrong person doesn’t just underperform. They make flow harder for everyone around them, regardless of how capable they are individually.

Most career decisions skip the most important question

Joining a team is a flow decision, even when nobody frames it that way. The questions most people lead with when evaluating a role (product, compensation, trajectory, growth potential) are real, but they don’t tell you whether you’ll consistently be able to do your best work there.

The problem is that the easy questions don’t surface this. “Is the culture good?” gets a yes from everyone. “Are the people smart?” Yes. “Is it a good business?” Obviously. These answers are almost always true and tell you almost nothing useful.

What’s harder to assess, and what actually matters, is more specific. Seen through the lens of what flow actually requires:

Does failure carry social cost here? Not the official line, but what actually happens when someone gets something wrong. If mistakes are treated as evidence of inadequacy rather than information, part of your brain will always be managing that risk instead of doing the work.

Can you lose yourself here? Flow requires self-consciousness to disappear. That only happens when you trust the people around you enough to stop monitoring yourself, to speak before an idea is fully formed, to raise a concern without gaming out how it will land. After an hour talking to these people, did you feel more like yourself, or more careful?

Is feedback here something that happens with you, or to you? The distinction matters more than the timing. Fast, direct feedback exists on plenty of teams, but what’s rarer is feedback from someone who’s been paying close enough attention to say something true, and who you believe is being honest because they want you to grow rather than because it’s efficient. When that’s the context, you can push back, sit with it, have a real conversation instead of spending the rest of the day managing your reaction to it. That overhead isn’t just frustrating. It keeps you out of flow.

Do people operate from shared values, not stated ones? Shared values reduce the cognitive overhead of working together. When you’re not constantly translating or interpreting, the cognitive space that frees up is exactly where flow lives.

None of these questions are on most people’s interview checklist but they probably should be.

The best career moves aren’t always to the biggest company or the most exciting product. Sometimes the right move is to the team where you consistently find yourself looking up at the end of the day, not quite believing what got built, because the conditions were right, not because you worked harder.

Those conditions are mostly the people around you. That’s worth looking for on purpose.

Thanks for reading Trevin’s Notes! This post is public so feel free to share it.

Performance Reviews Aren't Just for Feedback

Trevin Chow — Tue, 10 Feb 2026 15:03:47 GMT

Leaders are spending more time working alongside AI agents, and these digital teammates are efficient, tireless, and require no appreciation, no development conversations, no retention strategy. You can be terse with them, demanding, even dismissive of their “feelings” because they don’t have any. They work around the clock without complaint.

I’ve written before about how we need to stop thinking in human time when it comes to AI agents. But I keep coming back to the flip side: when you spend significant time working with agents that require no care, it can shape how you perceive and interact with your human colleagues.

Stanford psychologist Jamil Zaki has been tracking what he calls the “empathy gap” between leaders and their teams, and AI appears to be widening it. More than 80% of workers say AI will make human connection more important, but only 65% of managers agree. In surveys where leaders praise empathy as a core value, more than 90% of employees say their organizations still fall short. The disconnect is measurable…and it’s growing.

A Duke study published in PNAS last year examined how colleagues perceive workers who use AI:

"These judgments manifest as both anticipated and actual social penalties, creating a paradox where productivity-enhancing AI tools can simultaneously improve performance and damage one's professional reputation."

AI is already distorting how performance and competence get judged. Two popular management frameworks also aren’t helping.

Founder Mode didn't mean what people wanted it to mean

Paul Graham’s “Founder Mode” essay landed a few years ago and became one of those pieces that everyone wanted to cite and clap back at something they didn’t like in their companies. Graham’s actual argument was that founders should stay hands-on and make decisions themselves rather than defaulting to “hire good people and give them room to do their jobs.” He contrasts this with “manager mode,” which he critiques as leading to delegation that allows “professional fakers” to drive companies into the ground.

Graham predicted the misuse in his own essay at the very bottom in his final footnote:

“As soon as the concept of founder mode becomes established, people will start misusing it. Founders who are unable to delegate even things they should will use founder mode as the excuse.”

Brian Chesky, whose talk inspired the essay, has since lamented on The Verge’s Decoder podcast that:

"First of all, people don't know what founder mode is. They think it means swagger. I remember a tweet that said, 'I'm going founder mode on this burrito.' I don't know what that means. That wasn't the message."

The team at Oxide put it more bluntly in their reflection on the essay:

“Founders are at grave risk of misinterpreting Graham’s ‘Founder Mode’ to be a license to micromanage their teams, descending into the kind of manic seagull management that inhibits a team rather than empowering it.”

Leaders now use “Founder Mode” as justification for micromanagement, harsh accountability, and dismissing the human elements of leadership. But Graham’s essay is about decision-making authority and staying close to the work, not about abandoning care for your team. You can make strong decisions while also prioritizing retention and development, because these are independent variables.

Kim Scott’s Radical Candor framework gets misread the same way. Her model requires two dimensions working together: “Care Personally” and “Challenge Directly.” Drop the first half and you’re not practicing Radical Candor. Scott has a name for that quadrant: Obnoxious Aggression.

“Obnoxious Aggression is what happens when you challenge directly but fail to care personally. It’s praise that doesn’t feel sincere or criticism that isn’t delivered kindly. Obnoxious Aggression sometimes gets great results short-term but leaves a trail of dead bodies in its wake.”
—Kim Scott, author of “Radical Candor”

Leaders using “radical candor” as cover for harsh feedback without the care component aren’t being direct. They’re just being obnoxious. Period.

Performance reviews are a retention mechanism

Performance reviews serve multiple functions: development and improvement (what most leaders focus on), performance correction (what “candor” advocates emphasize), and forced communication of recognition and appreciation. The review is a structured, calendar-driven moment that exists in virtually every organization, a forcing function that means leaders don’t have to remember to schedule appreciation conversations because the system already requires them.

But if leaders approach reviews only as opportunities to deliver feedback and push for improvement, they miss out on a huge opportunity.

NPR reported on longitudinal research tracking employee career paths from 2022 to 2024, finding that employees who receive high-quality recognition are 45% less likely to leave their jobs over a two-year period, and those currently receiving meaningful recognition are 65% less likely to be actively job hunting. Yet only 22% of employees say they get the right amount of recognition for their work, and in May 2024, 51% of all U.S. employees were watching for or actively seeking a new job.

Gallup estimates that replacing an employee costs between one-half to two times their annual salary, and voluntary turnover costs U.S. businesses roughly $1 trillion annually. Those numbers get cited a lot, but the part that actually stings is more personal: losing someone you wanted to retain because they felt under-appreciated is a leadership failure. That kind of failure should haunt you, because it was preventable. The system gave you a built-in opportunity to show that person they mattered, and you spent it on pointing out what the person should be doing better at.

The work that only humans can do

As leaders spend more time directing AI agents, the temptation grows to apply that same efficiency-first mindset to human teams. But humans are not AI agents: they need to feel valued, they need to know their contributions matter, and they need care. As AI handles more routine work, these elements of leadership become more important, not less. Empathy, recognition, and retention strategy are what AI cannot provide. As Stefano Corazza, head of AI research at Canva, put it at the Fortune Brainstorm AI conference:

“The more AI there is, the more authenticity is valued. If your manager really shows that he will spend time with you and cares, that goes a long way.”

If you’re heading into review conversations in the coming weeks (or months… or whenever), remember what the system is actually giving you: a scheduled moment to show your people that you see them, you value them, and you want them to stay. Accountability and genuine appreciation have always coexisted. The review is already on your calendar. That's not a burden. It's a built-in opportunity to show someone they matter.

Thanks for reading Trevin’s Notes! This post is public so feel free to share it.

HZL: A Task Ledger for AI Agents

Trevin Chow — Tue, 03 Feb 2026 16:03:16 GMT

A few weeks ago, my friend Kalid and I were eating tacos at Tacos Chukis (the best tacos in Seattle, and I will die on this hill!) and talking about OpenClaw.

If you haven’t seen it, OpenClaw is one of the more exciting projects in the AI tooling space right now. It’s an open-source framework that lets you build a deeply personalized AI assistant, and because it’s open source, it’s almost infinitely customizable. You can wire it into your calendar, your email, your file system, whatever services matter to your workflow.

We’d both been using it heavily. Kalid runs Instacalc and Better Explained, and he’s always experimenting with tools that make complex work more tractable. I’d been trying to build a “morning brief” workflow where OpenClaw would synthesize my day, pulling together my calendar, family scheduling, notable events, and things I needed to know. It required access to multiple services, research capabilities, and a fairly elaborate plan.

The problem was that OpenClaw kept losing context. So. Damn. Frustrating 🤡

Over 30 mins, my session compacted eight times. Each time, detail disappeared and I’d have to re-explain things I’d already established. Eventually I had the agent create a markdown file to use as its own memory, which worked but felt like duct tape on a structural problem.

Kalid had been hitting the same wall, so we started talking about why.

The markdown file problem

If you’ve worked with AI agents for any length of time, you’ve probably arrived at the same workaround I did: markdown files as memory.

It’s become the de facto pattern in the post-AI world. Agents create markdown files for plans, add tasks as they go, and edit the files to mark things complete. I’ve built this into custom skills and agents myself over the past year, and it kind of works.

But it doesn’t scale well. How do you query tasks across multiple files? How do you organize them across projects? You end up with a sprawl of markdown documents scattered through your repo, using folders and your filesystem as an improvised task management platform. It functions, but it’s not optimal, and it breaks down entirely when you’re coordinating multiple agents or working on projects that don’t live in a git repository at all.

Human tools don’t speak agent

The obvious next step was to use a real task manager. We tried the built-in OpenClaw tools first, but they didn’t fit. Then we considered plugging in third-party services like Todoist or Linear, and those felt wrong too.

First, they’re slow. These are remote services optimized for human interaction speeds, and when you’re coordinating agent work, you want something fast. That typically means CLI-native and local.

Second, they’re built for humans. They assume workers have persistent memory, can update their own status naturally, and communicate blockers without prompting. They’re designed for people who remember what they were doing yesterday, and agents don’t have these properties. Context windows compact, sessions end, crashes happen, you close your laptop to sleep. When you resume, the agent doesn’t remember what it was doing, and there’s no durable record of where things stand.

Third, mixing agent work with your personal task list creates its own problems: agents and humans stepping on each other, different cadences of work, different granularities of tasks. It gets messy.

There’s also the multi-agent problem. It’s increasingly common to use different models for different tasks. Claude Code for some things, Gemini for others, local models when you want speed or privacy. Sometimes the choice is capability-driven because certain models handle certain tasks better, and sometimes it’s financial because you’re balancing costs across providers. Either way, you end up coordinating work across agents that have no shared memory and no awareness of each other.

We kept circling back to the same realization: the coordination infrastructure doesn’t exist. Markdown files are too unstructured, and human tools are too slow, too feature-heavy, and designed for a different kind of worker. What’s needed is something optimized for agent access.

Building HZL

We knew there were probably a thousand projects in this space already, but this was an excuse to build something together. We’d known each other for twenty years, our families are close, our social circles overlap, and we’ve always talked shop about technology and projects, but we’d never shipped anything as collaborators. Tacos created an opportunity.

We honestly weren’t sure it would work well, but along the way we were surprised at how right it felt for the problem at hand.

HZL is an external task ledger for AI agents. It’s backend-first, CLI-native, and model-agnostic. The core idea is simple: give agents a durable place to track work that survives context compaction, session restarts, and switches between models.

The initial brainstorm took about fifteen minutes. We recorded our conversation using Granola, then fed the summary into the Compound Engineering plugin’s brainstorming skill. From there, we iterated on an implementation plan using ChatGPT 5.2 Pro for another twenty or thirty minutes.

Then we started building, using a mix of Claude Code and Gemini for implementation. The split was partly by design and partly because I ran out of my Claude Code Max limits mid-session. The switch created exactly the kind of friction HZL is meant to solve. We had to create markdown documents and ensure both Claude Code and Gemini were marking off tasks in the same file, committing updates to the repo so neither agent would re-implement completed work or miss incomplete tasks. It was ironic, and it was validating.

The first working version took about four hours end-to-end. We’ve iterated remotely over the following days, adding features and smoothing rough edges, but the core has stayed stable.

How HZL works

Some people will immediately point to Steve Yegge’s Beads (or ports like beads-rust), which I’ve written about before. It’s an exciting project with real adoption, but it never quite resonated with my workflow. The complexity and its tight focus on repositories didn’t fit how I work, and the git hook integration in particular gave me enough grief that I never settled into a comfortable rhythm with it.

HZL is simpler by design. It’s a ledger, a task tracker. There are tasks and subtasks, one level of nesting deep, and that’s it. No epics, no bug types, no elaborate hierarchies. It doesn’t try to handle orchestration or coordination, leaving that to other tools. It just gives agents a durable, queryable place to track work with minimal structure for organization.

No repo required. This matters more than it might seem. In software, we naturally think of projects as GitHub repos and codebases, but complex work often involves research, planning, and execution of things that have nothing to do with code. My morning brief project wasn’t a codebase. It was a workflow involving calendar access, email parsing, and information synthesis, and that kind of work needs task tracking too without assuming a git repository exists.

Machine-readable everything. JSON output, CLI-first interface, explicit status transitions. Agents can query and update programmatically without parsing human-friendly formatting.

Lease support. When an agent claims a task, it can take a time-limited lease. If the agent dies, crashes, or gets stuck, the lease expires and another agent can pick up the work. Tasks don’t get permanently stuck because an agent wandered off. We may add traditional assignments in the future, but time-based leases solve a real problem that static assignments don’t.

Checkpoint-oriented design. Agents can save progress snapshots as structured state that another agent instance can parse and continue from. When context compacts or a session restarts, the checkpoint gives the next agent what it needs to resume. There are comments too, but the design centers on machine-readable resumption rather than human-readable updates.

One design choice worth noting: HZL assumes a single ledger for all your projects. We’re not doing per-folder tracking with isolated ledgers in different directories. Instead, a single HZL instance supports multiple projects through project IDs within the same ledger, which keeps things simple and queryable across everything you’re working on.

The technical implementation prioritizes speed. It’s local-first SQLite with a CLI interface because fast matters when agents are querying and updating task state. Turso cloud sync is completely optional if you want to sync to the cloud, but the primary interface is local.

There’s also a web dashboard with a real-only Kanban view, plus skills for OpenClaw, Claude Code, and policy snippets you can drop into your AGENTS.md or CLAUDE.md files.

When it fits and when it doesn’t

If you’re only using Claude Code and nothing else, the built-in task support they’re starting to build may be sufficient for your needs. But the moment you start thinking about multi-session work, even within just Claude Code, you’ll likely hit the same frustrations we did: either a sprawl of markdown files or awkward integration with human-focused project management tools.

HZL works well for multi-step work that spans sessions, workflows mixing multiple agents or models on the same project, non-code projects that still need task tracking, and “kick off a task, check back later” scenarios where you need visibility into progress.

It’s not the right tool for time-based reminders, due dates, recurring tasks, or calendar integration, and it’s not trying to replace org-wide backlogs. If you need rich human workflow features, GitHub Issues, Linear or Jira are better choices.

The sweet spot is personal and very small-team AI workflows where the tracking problem is real but the ceremony of team/enterprise tools are overkill.

From side project to public repo

Kalid and I built HZL because we kept hitting the same frustration and wanted to solve it for ourselves. We’ve known each other for twenty years but had never actually built something together, so this was a good excuse! (and tacos!)

When I’ve shown it to people running into the same walls, the reaction has been immediate positive. My friend Darren Apfel is building Limeriq, a multi-agent workflow tool for VS Code that orchestrates different models from different providers. So by design it runs directly into the tracking challenges HZL addresses. When I showed him what we’d built, his first response was:

“Well, my first feedback is that this definitely needs to exist”.

Hot damn.

We’re not claiming HZL will “change everything”, but so far it’s been a hugely fun to build huge boost for our productivity and visibility into what’s going on. It’s made some larger work items less stressful, so hopefully it’ll work for you?

Try it

If you’re working with AI agents and running into tracking friction, give HZL a look. The documentation is at hzl-tasks.com, and you can star the repo if it’s interesting or file issues or start discussions. You can also hit us up on X at @trevin, and @betterexplained. Build on!

Thanks for reading Trevin’s Notes! This post is public so feel free to share it.

Stop Thinking in Human Time

Trevin Chow — Fri, 16 Jan 2026 15:09:07 GMT

Last week, Claude Code was assessing the complexity of a new feature and it came back with a work breakdown and time estimate of “10-15 hours.”

Claude wasn’t wrong, exactly. It’s been trained on our documentation, our sprint retrospectives, our Linear tickets. It gave me back likely the same estimate a human would have given in a pre-AI era. But here’s the thing: the agent itself could do the work in about 15 minutes. It was estimating in human time because that’s what we’ve taught it to do.

We’ve spent decades calibrating our decisions around human labor constraints. Story points, t-shirt sizing, two-week sprints, “is the juice worth the squeeze?” All of it assumes that developer time is the scarce resource. With agents working in parallel while we sleep, that assumption is starting to look like an artifact of a world that no longer exists.

Example: Claude Code Opus estimates

The hidden cost of “do it later”

When we defer work, we tell ourselves we’re saving time, but we’re actually transferring cost to our future selves (and others) in ways that compound.

There’s the coordination overhead of tracking the work item, writing up context for whoever picks it up later, the prioritization meetings to decide if “later” ever comes. There’s the context switching tax when someone else inherits the problem and has to reconstruct what the original author was thinking. And there’s the compounding risk as more code builds on top of the deferred issue, making it progressively harder to unwind.

The 2018 Stripe Developer Report found developers spend 33-42% of their time on rework, bug fixes, and maintenance. Southwest Airlines learned this the expensive way in 2022 when deferred system updates contributed to an $825 million loss and triggered a $1.3 billion “modernization” commitment.

We historically accepted these transfer costs because immediate human time felt more expensive than future human time. But when an AI agent can fix something in minutes that a human would need hours/days/weeks to context-switch into, the calculus flips. The cost of deferring often exceeds the cost of just doing it now.

The vocabulary of human time thinking

Once you start looking for it, human time thinking still shows up everywhere in how we talk about work.

PR feedback gets marked “non blocking”, “nit pick” or “nice to have later” because asking someone to address it now feels expensive. Refactoring discussions start with “is it worth it?” as if the answer depends entirely on labor cost. Scale and performance work gets pushed off because “we don’t have that problem yet,” which is another way of saying “we don’t want to spend human time on it yet.”

The classic advice “don’t prematurely optimize” made perfect sense when optimization meant days of profiling and careful restructuring. It makes less sense when an agent can run through common optimization patterns in an afternoon while you’re in meetings.

Even agents themselves perpetuate this framing. They give estimates in hours and days because they learned from documentation written by humans for humans. They’ll tell you a migration “should take about a week” when what they mean is “this would take a human about a week, but I could do it tonight.”

Parallelization changes everything

The constraint isn’t capacity anymore, it’s coordination.

An agent can work while you sleep, and multiple agents can run simultaneously on different parts of a problem. Wall clock time and human time have decoupled in ways our planning processes haven’t absorbed.

When I run four Claude Code instances in parallel on related parts of a codebase, the current bottleneck is merge conflicts and integration decisions. The work itself happens fast, but making sure the parallel streams cohere into something that actually functions together is where the time goes. This particular friction point will get solved as tooling catches up, but the broader pattern holds: every time agents get faster, the constraint shifts to whatever humans are still doing manually.

This is a fundamentally different dynamic than “we don’t have enough developer hours.” It requires different processes, different tooling, and different intuitions about what’s expensive and what’s cheap.

The API cost distraction

Teams obsess over and scrutinize API spend, optimize prompts to reduce token usage, and track inference costs down to the penny. This makes sense as line-item accounting (which is common in large organizations), but it misses the larger picture.

Inference costs are collapsing:

Epoch AI’s analysis shows prices falling 9x to 900x per year depending on performance level
The a16z LLMflation analysis found a 10x cost decrease annually for equivalent performance, faster than the PC revolution or dotcom bandwidth expansion
Stanford’s 2025 AI Index documented a 280-fold cost reduction for GPT-class models between 2020 and 2024.

Meanwhile, the costs we’re not measuring keep compounding: opportunity cost when deferred work blocks future features, coordination cost when context gets lost between deferrals, and risk cost when technical debt makes systems brittle in ways that only surface during incidents.

Measuring API spend while ignoring deferred work costs is like optimizing for gas mileage while ignoring that you’re driving in circles.

What actually changes

The tactical shifts are relatively straightforward once you accept the underlying premise.

For PR feedback, stop automatically ignoring anything marked “non-critical.” Low priority and nitpick comments become valid to fix immediately when agent time is cheap. You’re not asking a human to context-switch; you’re asking an agent to make a quick pass before the PR closes.

For backlog management, batch the “someday” items to an agent overnight. That pile of minor refactors and cleanup tasks that never quite makes it into a sprint? Let an agent work through it while you’re asleep. You might be surprised what’s feasible when the limiting factor isn’t human attention.

For estimation, challenge the implicit cost assumption in every “is it worth it?” question. The answer might have been “no” when it implied pulling someone off higher-priority work. It might be “yes” when it implies queuing something for an agent.

The cultural shifts run deeper. When someone asks "how long will this take?" we instinctively answer in human-hours, human-days, human-sprints. The entire grammar of estimation assumes human labor as the unit of cost. When agents work 24/7 in parallel, the complexity-to-cost relationship changes in ways that grammar wasn't designed to capture. I don't have a clean replacement yet, but I'm increasingly confident the current vocabulary will look as dated as waterfall diagrams within a few years.

The tooling gap

Our work tracking systems were designed for human workflows, and it shows.

Jira, Linear, GitHub Issues—all of them assume a human will pick up a ticket, work on it, and mark it done. None of them have good primitives for queuing work to agents, distinguishing what’s safe for autonomous handling versus what needs human judgment, or coordinating between multiple agents and humans asynchronously.

This is starting to change. Steve Yegge’s Beads project is exploring agent-optimized workflows. Tools like CrewAI, LangGraph, and AutoGen are building multi-agent orchestration patterns. Visual workflow builders like n8n and Dify are making it easier to design agent pipelines without writing custom code.

The patterns emerging include sequential handoffs, concurrent execution, maker-checker loops, and dynamic task building. These aren’t academic exercises but instead are responses to practical coordination problems that show up the moment you try to run lots of agents (and then nested sub-agents) on real work.

The villain is us

We are our own bottleneck here. We’ve internalized mental models based on constraints that no longer apply, and we keep reinforcing them in our documentation, our processes, and the training data we feed to agents.

The shift isn’t philosophical, it’s operational. Every time we ask “do we have bandwidth for this?” or “can we fit this in the sprint?” or “is this worth the engineering time?”, we’re applying a heuristic that made sense when human time was the binding constraint. When agent time is cheap and human time is better spent on judgment, taste and (some) orchestration, those questions need different answers.

The cost of inference is collapsing while the cost of deferred work compounds. The sooner our mental models catch up to that reality, the sooner we stop punting work that should just get done.

2026: The Year of Agent Orchestration

Trevin Chow — Fri, 09 Jan 2026 20:49:30 GMT

A year ago, people were still questioning whether AI could do basic math. Now we’re questioning whether we can keep up with systems that work faster, longer, and in parallel. The bottleneck has flipped, and it flipped faster than most people want to admit.

The capability is there in the foundational models, but what’s missing is the orchestration layer that makes it usable at scale. In 2026, a lot of money and attention is going to pour into orchestration and observability, first in developer tooling and then throughout roles in the rest of the organization. We will watch IDEs evolve into agent control centers, and then watch those patterns spread into product, design, marketing, operations and anywhere else work is complex and interdependent.

The teams who figure out orchestration will unlock productivity gains that make “10x engineer” sound quaint. Not because engineers suddenly become superheroes, but because the unit of work stops being a single person’s output. It becomes a coordinated system.

The bottleneck has flipped

Recently, I recently had 4 Claude Code instances working in parallel on related parts of a side project I’m working on. Each agent’s work was scoped carefully to minimize overlap while still being cohesive enough to merge. They finished over a 35-minute period.

Then I spent 25 minutes managing merge conflicts, which is a fun reminder that the agents did the hard part and I did the part nobody wants to.

I could have picked three unrelated areas and avoided most of the integration pain, but that is not how real work flows. Related problems need related solutions. Once you are operating in a single codebase, work entangles. When agents become capable enough to move quickly inside that entanglement, integration becomes the constraint. This is no different than humans working in the same codebase, it’s just the agents are able to work 24x7.

This crazy part is how quickly this has evolved from being a model problem into one of coordination.

In a remarkably short time, we have gotten MCP servers, agent systems, skills frameworks, and exponentially improving foundational models. For most situations, individual task capability is not the limiting factor anymore. The challenge has shifted toward multiplying yourself across many work streams simultaneously while keeping the work coherent.

Skeptics will look at orchestration friction and say it proves AI isn’t as capable as proponents claim. The opposite is true. Merge conflicts do not exist because the agents are failing. They exist because the agents are succeeding fast enough that human coordination becomes the bottleneck. A year ago, we would not have had multiple meaningful changes worth merging in the first place.

The struggle has moved up the value chain.

Where orchestration will emerge first

We’ve been conditioned to think of IDEs as places to write code, debug code, and browse repositories. That mental model is already outdated. Tools like Cursor expanded what an IDE does by putting agents at the center, but we are still in the messy part.

The IDE of the near future will not primarily be about writing code. It will be about spinning up and controlling agents, giving them direction and context, auditing what they’ve done, and maintaining observability into parallel workstreams. Today, most agent workflows still collapse down into terminal transcripts. That interface works when there is one thread of work. It breaks down when work becomes parallel and interdependent.

Josh Puckett captured the interface problem perfectly:

This hits because not only did I grow up playing real-time strategy games, but they also inherently tackled the “coordinate many parallel workers” interface problem decades ago. They make ownership, progress, dependencies, and collisions visible. As much as I adore Claude Code and the terminal, it simply isn’t a control plane. We suffer with an inability to effectively know what agents are working on, how they intersect or how tasks relate. Once you are running even a handful of agents, you need to see the whole board.

This is why orchestration becomes inseparable from observability. Observability means you can answer, quickly and confidently, what each agent is doing, what it touched, what it assumed, what it produced, and what it is about to break. Without that, you are not orchestrating a system, you’re along for the ride (and likely in denial).

Signals that the market is already forming

The best evidence that orchestration matters is that developers are already building it for themselves. Tools are showing up because the pain is immediate and the constraint is obvious.

Steve Yegge’s “Gas Town” essay is one of the clearest articulations of the new frontier:

“I went to senior folks at companies like Temporal and Anthropic, telling them they should build an agent orchestrator, that Claude Code is just a building block, and it’s going to be all about AI workflows and ‘Kubernetes for agents’. …
‘It will have a Merge Queue,’ I said.
‘It will orchestrate workflows,’ I said.
‘It will have plugins and quality gates,’ I said.
…
So in August I started building my own orchestrator, since nobody else seemed to care.”

That quote matters because it is not a vague prediction. It names the primitives you inevitably need once work becomes parallel: merge queues, supervision layers, workflows, plugins, and quality gates. It also captures the adoption gap. Most developers are still using a single agent as an enhanced copilot. The frontier is already thinking in terms of swarms, supervision, and reliable pipelines.

Other tools and projects are emerging for the same reason. Conductor is a great example and one of my new favorite tools. On the surface it appears to be a thin shin on top of Claude Code instances and Git worktrees. However, once you dig deeper, you appreciate all the clever things their team has done to make this a much better orchestrator than you first realize.

Other projects like Claude Squad, Claude Flow, and Auto Claude have also shown up because engineers have both the incentive and the ability to scratch their own itch.

This is not limited to developers. Workflow automation platforms are becoming the bridge that brings orchestration to technical teams outside engineering. n8n is a good example of what happens when orchestration gets packaged into something usable for people who understand their business problem but do not want to operate a developer stack.

Companies do not want “AI.” They want work to get done, reliably, without the whole thing turning into a science project where you’re lighting money on fire.

“The AI race isn’t only about smarter models. It’s about who can actually put that intelligence to work reliably, inside actual businesses.”

n8n series C announcement

The velocity of change should recalibrate expectations again. The intelligence is compounding quickly which is forcing the bottleneck to shift to the systems that coordinate it.

The new math

We used to talk about 10x engineers. But when you can spin up multiple agents working in parallel, the math changes. Leverage is no longer limited by how fast you can type or how many hours you can work. It is limited by how many agents you can direct effectively across interconnected workstreams while keeping output coherent and integration cost low.

Over time, this also changes who can do technical work. As orchestration improves, more work that used to require deep engineering expertise will be done by product managers, designers, operations teams, and domain experts who understand the problem best, with agents handling execution. The boundaries between roles soften because the limiting factor becomes coordination and judgment, not implementation.

The individual agent capabilities are largely solved. What is missing is the layer that makes those capabilities composable, parallel, and reliable.

That is why 2026 looks like a real inflection point to me. Orchestration becomes the focus, and investment pours into the tools and interfaces that let humans direct multiple agents without losing coherence. As those patterns mature, they will spread beyond engineering into product, operations, and design through interfaces that look nothing like command lines.

2026 is the year the orchestration control plane becomes real.

From Prompts to Pipelines: The OS Layer AI Agents Are Missing

Trevin Chow — Wed, 17 Dec 2025 06:06:41 GMT

In 1974, Unix introduced a radical abstraction: everything is a file. Sockets, devices, processes, all accessible through the same interface. It sounds almost obvious now. It wasn’t. Before Unix, every piece of hardware needed its own special integration. Programmers loved it. (They did not love it.)

Fifty years later, we’re building AI agents the way people built software before Unix: every integration is bespoke, state is scattered, and there’s no unified way to navigate what the system knows. It’s as if we’ve forgotten lessons of the past.

A new paper argues we need the same shift for AI. And if you look at where Claude Code, Cursor, and MCP infrastructure are headed, you’ll see it’s already happening. Slowly. Painfully. But this is the way all massive change and transformation happens.

Model intelligence isn’t the problem

You’ve probably experienced this: an AI agent that works brilliantly for 5 minutes, then loses the thread. Same prompt, same model, same tools. But it forgot what you told it three turns ago. Or it hallucinates a information that doesn’t exist. Or it burns through your token budget loading context it didn’t need and then telling you it needs to compact the conversation.

You point this out. “You’re absolutely right, I apologize for the confusion!” it replies cheerfully. Then it hallucinates the same information but this time different wording that sounds slightly credible.

You try again, more firmly this time. You use capital letters. You add “IMPORTANT:” to the prompt. The model thanks you (again) for the clarification and does something entirely different but equally wrong.

The bottleneck isn’t “how smart is the model?” It’s “how well do we manage its context?”

That’s the argument in Everything is Context: Agentic File System Abstraction for Context Engineering. The authors propose that context engineering, not prompt engineering, is the discipline that matters now. Prompt engineering is asking “how do I phrase this so the model understands?” Context engineering is asking “why does this thing keep forgetting everything I tell it, and what infrastructure would fix that?”

What if everything the agent sees were a file?

The paper’s core move is borrowed from Unix: create a unified namespace where context sources (memory, tools, knowledge bases, human input) all appear as files and directories.

In their Agentic File System (AFS):

MCP servers, vector stores, APIs, logs, and user profiles mount into a single addressable space
Agents use a tiny tool surface: afs_list, afs_read, afs_write, afs_search, afs_exec
Backends vary underneath: relational DBs, vector stores, knowledge graphs. The agent doesn’t care. That’s the point.

Why does this matter? Because right now, every new capability means new tool definitions crammed into the context window. “Here are 47 functions you can call, please read all of them carefully before doing anything.” AFS flips that: instead of eagerly loading everything, the model discovers and loads what it needs on demand. Like a file system. Because it is one.

If you read my previous post on Docker’s Dynamic MCP, this is the same instinct elevated to a more general, and powerful, layer. AFS is what Dynamic MCP plus a good memory system might look like as a first-class OS abstraction. Or: what we’d build if we admitted that “just make the context window bigger” isn’t a strategy.

3 memory layers with actual semantics

Most “memory” in current agent systems is either RAG (retrieve relevant chunks and hope for the best) or naive caching (remember the last N messages until you don’t). The paper proposes something more structured:

History. Immutable log of everything. Every interaction, every tool call, every model output. Used for provenance, debugging, compliance. Your source of truth. (Path: /context/history/)

Memory. Indexed, structured views optimized for retrieval. Episodic memory for session summaries. Fact memory for atomic statements. User memory for preferences. Procedural memory for tool definitions. (Paths like /context/memory/agentID)

Scratchpad. Temporary workspace. Agents draft plans, test hypotheses, do intermediate reasoning here. Can be selectively promoted to Memory or archived to History. (Path: /context/pad/taskID)

All transitions between layers get logged with timestamps and lineage metadata. You can trace how context evolved, which is essential once you care about governance. Or once something goes wrong and you need to figure out why.

This is exactly what all the “memory” tools we are seeing crop up: structured memory graphs that let coding agents follow linked tasks over time instead of reinventing context every session. The alternative is your agent starting fresh every time like a goldfish with a very expensive API bill.

The pipeline that makes context manageable

The paper formalizes three components:

Context Constructor. Selects and prioritizes what enters the token window. Uses metadata to rank relevance. Compresses to fit budget. Outputs a manifest documenting what was included, what was excluded, and why. Finally, a system that can explain why it ignored your carefully written instructions. (I’d kill for just this part right now)

Context Updater. Controls when and how context flows in. Static snapshots for single-turn tasks. Progressive streaming for extended reasoning. Adaptive refresh for dynamic sessions. Keeps the window coherent as the conversation evolves. This is the part most current systems skip entirely, which is why your agent gets progressively more confused over time.

Context Evaluator. Closes the loop. Validates outputs against source context. Flags hallucinations and contradictions. Writes verified results back to Memory with versioning. When confidence is low, triggers human review and stores those corrections as first-class context elements. In other words: catches the model lying and takes notes for later.

The pipeline exists because GenAI has three architectural constraints that cascade upward:

Token windows are finite and expensive
Models are stateless between sessions
Outputs are probabilistic (same input can yield different outputs, none of them necessarily correct)

Once you internalize these constraints, “prompt engineering” definitely starts to feel like the wrong frame. Despite what every vibe coder and youtuber says, your success won’t strictly come from crafting clever prompts. You’re managing an information lifecycle. The clever prompts are a coping mechanism and you just realized you’ve been in denial this whole time.

Humans as co-processes, not just supervisors

Here’s a choice I appreciate: human annotations, overrides, and corrections get stored as explicit context artifacts under /context/human/.

They’re versioned, queryable, reusable. Humans become co-processes in the system. Their judgment enters the same context fabric as everything else. Not as an afterthought. Not as a Slack thread someone screenshots and pastes into a prompt.

If you’re building for regulated domains (healthcare, finance, legal), this matters enormously. You want human decisions and AI behavior recorded together, not scattered across logs, tickets, and random docs that nobody will find when the auditor asks.

The tools are already converging here

Look at what’s shipping:

Claude Code now has project memory and portions of a longer term memory. 3rd party tools like claude-mem extend it further into persistent teammates instead of stateless interns who forget your name every morning.

Cursor and VS Code has a growing ecosystem of memory banks and MCP-based backends (Graphiti, Cline Memory Bank) keeping project context alive across sessions.

Beads provides structured memory and issue graphs for coding agents executing long task chains.

These are all pragmatic answers to the same pain points: stateless models, limited windows, the need for long-lived governed context. The industry is converging on this whether or not anyone’s read the paper.

What’s still missing

The paper is more opinionated on traceability than most current tools:

Every context transition logged as a transaction
Evaluation results, confidence scores, and human overrides stored as auditable metadata

Most agents today still operate as “black box but helpful.” You ask it to do something. It does something. Maybe the right thing. You find out eventually?

Observability is improving, but the paper’s implicit argument is: if you want AI in mission-critical workflows, governed context is as important as the best possible prompts. Probably more important. Your clever prompt isn’t going to save you when the model confidently fabricates information that results in disaster in a mission critical workflow.

Where this leads

Frontier models will keep improving. Dare I say that’s tablestakes?

The paper gives you a frame for where non-model innovation happens:

Context infrastructure. Unified namespaces, progressive tool access, agentic file systems.
Memory systems. History, memory, scratchpad with lifecycle semantics, pruning, dedup.
Governance. Traceable pipelines you can replay, audit, correct.
Humans as data. Annotations and overrides as first-class context, not afterthoughts.

Claude Code, Cursor, Beads, Dynamic MCP. We’re seeing early expressions of these ideas. The paper is the architectural blueprint that explains why they’re all converging. Not because anyone coordinated. Because the problems are real and the alternatives are worse.

Did Docker Just Solve MCP’s Two Biggest Problems?

Trevin Chow — Sat, 06 Dec 2025 16:48:36 GMT

Run this command and you’ve given a stranger access to your machine:

npx -y @some-random/mcp-server

That’s how most MCP servers get installed today. No sandboxing. No verification. Just arbitrary code execution with whatever permissions your terminal has.

Docker’s blog calls it what it is: developers “making a dangerous trade-off: convenience over security.”

For teams expecting to run MCP servers in trusted environments, this is usually a non-starter. You can’t have engineers pulling random servers the way they might install npm packages for a side project. The blast radius is pretty much unlimited.

But trust isn’t the only blocker.

The Context Bloat Problem

Anthropic recently acknowledged what power users have been experiencing:

“Today developers routinely build agents with access to hundreds or thousands of tools across dozens of MCP servers. However, as the number of connected tools grows, loading all tool definitions upfront and passing intermediate results through the context window slows down agents and increases costs.”

Connect 25 MCP servers and your agent loads hundreds (or even thousands?) of tool definitions before doing anything useful. That’s slower inference, higher costs, and worse performance—the lost in the middle problem means models struggle to use tools buried deep in massive context windows.

“Just make the context window bigger” doesn’t solve this. You’re still paying for tokens you don’t need and degrading the model’s ability to find what matters.

Docker Built What Anthropic Proposed

Anthropic describes solutions like “progressive disclosure”: loading tool definitions on-demand rather than upfront. They suggest a search_tools capability so agents can find relevant tools without loading everything.

Docker’s Dynamic MCP implements exactly this:

mcp-find: Search the catalog for MCP servers by name or description
mcp-add: Add a server to the current session on-demand
mcp-remove: Remove servers you no longer need

Instead of pre-configuring every MCP server before starting a session, agents discover and add servers during the conversation. The context window contains only what’s actually being used ✨.

Trust Through Containerization

For the security problem, Docker’s MCP Catalog layers trust through containerization plus curation.

Docker-built servers get the full treatment:

Cryptographically signed images
Complete provenance and SBOM (Software Bill of Materials) metadata
Continuous security maintenance

Community-built servers still run containerized:

Isolated with restricted resources (1 CPU, 2GB memory)
No host filesystem access by default
Clear labeling distinguishing them from Docker-maintained options

This isn’t blind trust. It’s trust with verification and blast radius limits. For teams, the Docker-built servers provide production-level confidence. For individuals, containerization limits the damage even a malicious server could do despite potentially lower stakes.

The Pace Is the Point

Here’s the timeline:

November 2022: ChatGPT launches
November 2024: Anthropic launches MCP
March 2025: OpenAI adopts MCP
November 2025: Docker ships Dynamic MCP with containerization and curation

MCP is 13 months old. We already have what feels like unlimited MCP servers, major competitors on the same open standard, and now at least one option for production infrastructure addressing the protocol’s biggest limitations.

We’re still rubbing sticks together to make fire, but we went from sticks to matches to lighters in a year.

The AI we use today is the worst it will ever be but that’s not criticism. It’s the most optimistic thing you can say about where we’re headed.

The Takeaway?

I don’t think Docker’s Dynamic MCP is the whole story. What stands out more to me is just how quickly real problems are finding solutions in this new age of building with AI.

If you’ve been waiting because MCP felt too risky or too unwieldy for serious use, the blockers are starting to fall. Worth paying attention.

DeepSeek V3.2 Is a Reminder: Build Flexibility Into Your AI Strategy Now

Trevin Chow — Wed, 03 Dec 2025 06:32:07 GMT

DeepSeek just dropped V3.2 and its high-compute variant “Speciale,” both fully open source under an MIT license. The benchmarks are striking, but the bigger story isn’t about who’s “winning.” It’s about what this means for how you should be thinking about AI strategy.

The Numbers That Matter

Let’s start with costs, because this is where it gets real:

Input vs Output Costs per 1M Token by Model

For a workload generating 1 million output tokens per month, that’s a 20-30x cost differential.

The performance? DeepSeek V3.2-Speciale scored 96.0% on AIME 2025 (advanced math) versus GPT-5 High’s 94.6%. On Terminal Bench 2.0 (coding workflows), DeepSeek hit 46.4% compared to GPT-5’s 35.2%. The model achieved gold-medal performance in the 2025 International Mathematical Olympiad and International Olympiad in Informatics.

They achieved this partly through DeepSeek Sparse Attention (DSA), an efficient attention mechanism that cuts compute on long-context tasks while preserving output quality.

This is an open-source model matching or beating frontier closed models on key benchmarks at a fraction of the cost.

We’re Still Early. Act Like It.

Here’s what gets lost in the “who’s ahead” debates: we are still in the early, volatile phase of AI development. Today’s pricing won’t be tomorrow’s pricing. Assumptions about market structure haven’t settled.

According to Epoch AI research from March 2025, LLM inference prices are dropping between 9x and 900x per year depending on the benchmark, with a median of 50x per year. Since January 2024, that median rate has accelerated to 200x per year.

We’ve seen this pattern before. In early 2021, open-source databases overtook commercially licensed databases in popularity for the first time, according to DB-Engines. Oracle, which once dominated enterprise databases, now sits at about 12% developer usage. The premium option didn’t disappear, but it went from default choice to niche option as open alternatives matured.

AI is moving faster than databases ever did. DeepSeek caused market chaos in January with their release of R1, went quiet for better part of a year while OpenAI shipped GPT-5 (then 5.1) and Anthropic released their Claude 4.5 family of models. DeepSeek comes back hard with their release yesterday.

The Cost Nuance

There’s an important caveat. Right now, many teams chase the frontier, upgrading to GPT-5 or Claude 4.5 as soon as they ship. This means they’re not capturing the cost reductions at the capability tier they actually need. You keep paying premium prices because you keep upgrading to premium models.

But this won’t last forever. At some point, “good enough” really is good enough for most business use cases. When teams stop reflexively chasing the latest model and instead match capability to need, the dramatic cost reductions we’re seeing will finally flow through to the bottom line.

The question is whether your architecture will be ready to take advantage of it.

The Case for Model Agnosticism

I’ve talked with various product and engineering leaders over the last 6 monsh, and building model flexibility has become a consistent priority. In my own side projects, I always use OpenRouter, a routing layer that lets you switch between models without changing your integration.

Tools like OpenRouter, Chutes.ai, and self-hosted options like LiteLLM aren’t fringe anymore. They’re becoming standard infrastructure for teams that smartly want optionality.

This isn’t about abandoning Claude or GPT. It’s about recognizing that when the landscape moves this fast, flexibility has real value:

Vendor leverage: Viable open-source alternatives give you negotiating power, even if you never switch.

Cost optionality: Having the ability to route workloads to cheaper models as capability gaps narrow is valuable as pricing continues to shift.

Use-case matching: Not every task needs a frontier model. Route complex reasoning to GPT-5, handle simpler tasks with DeepSeek (or another open source model).

…so what?

The AI landscape is more dynamic than the past few months suggested. Just because closed-source models have been making impressive strides doesn’t mean open source stopped innovating.

The practical takeaway is straightforward: if you haven’t already, focus on building flexibility into your AI stack now. Not because you need to switch today, but because the cost of that flexibility is low and the optionality is valuable in a market that’s still finding its shape.

We’re early. The model that’s best today may not be best in six months. The pricing that seems fixed today will look very different soon. Build accordingly.

Your team ships faster with AI. So why are results flat?

Trevin Chow — Sat, 22 Nov 2025 21:04:34 GMT

You open your fancy AI voice transcription app. Talk through a feature with your AI of choice. It helps you iterate, bangs out the PRD, and then an AI agent helps your engineers build it, and it ships. Done before lunch.

Everyone’s talking about teams vibe-coding their way to 10x velocity. You’re hitting those numbers too. You feel superhuman. Fifty deploys in six weeks.

Revenue is flat. Customer satisfaction moved sideways.

What’s going on?

The constraint shifted while you were celebrating velocity

For years, product teams pointed at engineering capacity as the obvious bottleneck. Designers waited weeks for engineers to implement simple flows. PMs watched great ideas pile up in backlogs with no path to production. You brainstormed “innovation” and “no meeting Wednesdays” to try to boost productivity.

AI seemed to solve this. Code generation turned specs into features overnight. No-code tools let designers ship without engineering. Natural language interfaces converted Slack messages into ~~JIRA~~ Linear tickets.

Teams declared victory over the bottleneck.

Oops. They were wrong.

The bottleneck moved to another place almost overnight: deciding what’s worth building at all.

When everything becomes easy to ship, nothing feels important to ship

Walk through a typical product team’s week after AI enters the picture:

Monday: Designer generates three homepage variants, ships the one that tested highest in the moment.

Tuesday: PM uses AI to write specs for five small features customers mentioned in passing.

Wednesday: Engineer pairs with AI to build all five features by Thursday.

Friday: Team reviews analytics. Nothing moved. The product feels less coherent than it did last month.

Here’s what went wrong: AI eliminated the friction that used to force prioritization. When shipping carried high costs, teams naturally filtered ideas through “is the juice worth the squeeze?” Now shipping feels free, so everything ships.

The backlog became an execution queue, not a strategy document.

What AI actually automates: the shallow work

If your version of product management means:

Turning messy stakeholder requests into clean requirements
Routing work between teams
Sitting in status meetings documenting decisions
Tweaking copy and UX elements in isolation
Writing specs that restate obvious points

AI handles this today. Language models convert context into artifacts. Automation tools coordinate handoffs. GenAI tools produce variants on demand.

This is the shallow layer of PM: coordination disguised as strategy, documentation disguised as decision-making.

When people say “PM is dead,” they’re describing this layer. And they’re right. It should die.

The new bottleneck: judgment under abundance

When execution gets cheap, scarcity moves to judgment.

Real product management now means:

Conviction. Your team can ship anything. Which problems actually matter? Which customer segment deserves focus? Which ideas are technically easy but strategically wrong? Conviction means saying no when saying yes feels effortless.

Taste. AI generates 20 landing page variants in 2 mins and 12 seconds. You iterate, eliminate the mediocre ones. 3 are strong and worth A/B testing. Taste means knowing the difference without running tests unnecessarily.

Coherence. You team ships UX updates, add features and run experiments. Nothing contradicts the product vision because nobody is holding one. Coherence means ensuring the product feels intentionally designed, not randomly assembled.

These skills don’t scale with AI. They get harder as AI makes everything else easier.

How the role evolves

The PM that survives AI looks nothing like the PM that preceded it.

Old PM skills:

Writing detailed specs (ask me about Microsoft days in early 2000s with 30+ page specs commonplace)
Coordinating across teams
Managing backlogs
Triaging bugs
Running status meetings and sharing notes afterwards

New PM skills:

Spending more time with customers while AI handles note synthesis
Using AI to explore ten product directions, then applying judgment to pick one
Maintaining product coherence when everyone can ship independently
Deciding which problems deserve solving when all problems feel solvable
Holding conviction when AI makes every tactic feel achievable

PMs used to ask: “How do we build this?”

The best PMs should now be asking: “Should we build this at all?”

Where you actually spend your time

Look at what consumed your time this week.

Did you spend most of it on:

Status updates and coordination
Ticket writing and spec drafting
Routing requests between teams
Making sure work happened on schedule

Or did you spend it on:

Direct conversations with customers
Turning insights into clearer strategic bets (or tests)
Aligning your team around a coherent direction
Deciding which opportunities to deliberately ignore

The first list means you’re competing with AI. The second list means AI amplifies you.

What comes next

Product management isn’t dying. It’s mutating into something different and more valuable.

The shallow version (coordination, documentation, ticket management) gets automated. The deep version (conviction, taste, coherence) becomes essential.

AI didn’t remove your bottleneck. It moved it to the one place automation struggles: judgment about what matters.

Teams that realize this early stop celebrating velocity and start investing in conviction. They use AI to handle execution while building their judgment muscle.

The PM role that survives looks less like a project coordinator and more like a curator: someone who maintains intentionality when the cost of shipping drops to zero.

That’s not dead. That’s just getting started.

Did we break product documentation? Is AI forcing us to fix it?

Trevin Chow — Sat, 22 Nov 2025 01:06:45 GMT

The rise of Model Context Protocol (MCP) servers tells an interesting story about toolchain fragmentation.

Anthropic released Claude with desktop integration. What felt like overnight, the ecosystem exploded with MCP servers: Notion, Linear, Jira, GitHub, file systems, etc. Each one a bridge between our AI robot friends and their necessary sources of truth and input.

We’ve distributed product context across disconnected systems, and now we’re scrambling to reconnect them for AI consumption.

Product teams adopted collaboration platforms for good reasons. Google Docs, Notion and Coda delivered real-time collab, richer formatting, cross-functional visibility. These tools improved how we all worked together.

But they also introduced distance. The PRD explaining a feature lives in Notion. The issues tracking its implementation live in Linear. The code implementing it lives in GitHub. The AI agent trying to help your engineer (or write code on its own!) needs all three.

So we build integration layers. MCPs for context retrieval, Webhooks for notifications, blah blah. Each added latency, failure modes, maintenance, etc.

Meanwhile, files (esp markdown) sitting in Git repos have always offered something different: adjacency. When documentation lives in the same repository as code, version control is unified. AI systems access context through the same interface they use for code. The integration tax drops to zero.

The tradeoff is real. Git workflows remain challenging for non-technical team members. Markdown lacks the collaborative features—inline comments, @mentions, rich tables—that product teams depend on. This isn’t a simple swap.

But it suggests an architectural question: what if the next generation of product tools built on Git as infrastructure rather than treating it as one integration target among many?

You’d preserve proximity while rebuilding the collaboration layer on top. Product context becomes a first-class artifact in the repository. AI systems get native access. PMs get the UX they need.

The interoperability boom isn’t a mistake, it’s a rational response to toolchain sprawl. But the fact that we need it signals a deeper fragmentation we’ve normalized.

Product documentation’s center of gravity is shifting back toward code. The question is whether we should continue to ride that shift or should it go in another direction?

What’s your team’s documentation strategy in the AI era?