Effective harnesses for long-running agents

122 points by diwank 2 days ago

roughly 2 days ago

One of the things that makes it very difficult to have reasonable conversations about what you can do with LLMs is the effort-to-outcome curve is basically exponential - with almost no effort, you can get 70% of the way there. This looks amazing, and so people (mostly executives) look at this and think, “this changes everything!”

The problem is the remaining 30% - the next 10-20% starts to require things like multi-agent judge setups, external memory, context management, and that gets you to something that’s probably working but you sure shouldn’t ship to production. As to the last 10% - I’ve seen agentic workflows with hundreds of different agents, multiple models, and fantastically complex evaluation frameworks to try to reduce the error rates past the ~10% mark. By a certain point, the amount of infrastructure and LLM calls are running into several hundred dollars per run, and you’re still not getting guaranteed reliable output.

If you know what you’re doing and you know where to fit the LLMs (they’re genuinely the best system we’ve ever devised for interpreting and categorizing unstructured human input), they can be immensely useful, but they sing a siren song of simplicity that will lure you to your doom if you believe it.

zephyrthenoble 2 days ago

Yes, it's essentially the Pareto principle [0]. The LLM community has conflated the 80% as difficult complicated work, when it was essentially boilerplate. Allegedly LLMs have saved us from that drudgery, but I personally have found that (without the complicated setups you mention) the 80% done project that gets one shot is in reality more like 50% done because it is built on an unstable foundation, and that final 20% involves a lot of complicated reworking of the code. There's still plenty of value but I think it is less than proponents would want you to believe.
Anecdotally, I have found that even if you type out paragraph after paragraph describing everything you need the agent to take care of, it eventually feels like you could have written a lot of the code yourself with the help of a good IDE by the time you can finally send your prompt off.
- [0] https://en.wikipedia.org/wiki/Pareto_principle
- roughly 2 days ago
  
  Yeah, my mental model at this point is there’s two components to building a system: writing the code and understanding the system. When you’re the one writing the code, you get the understanding at the same time. When you’re not, you still need to put in that work to deeply grok the system. You can do it ahead of time while writing the prompts, you can do it while reviewing the code, you can do it while writing the test suite, or you can do it when the system is on fire during an outage, but the work to understand the system can’t be outsourced to the LLM.
- beefnugs 2 days ago
  
  This can't really be the full story, or else people would have already come up with the "first line developer" like first line support. There is a dumbass or executive who creates that first 70 or 80%. Then hands off the entire thing to a professional developer to keep working on it.
  The AI people sure dont want that, thats too telling about its limitations and value
- peab 2 days ago
  
  Except in the past, I'd perhaps have to hire a junior engineer to do that 80%. Now i don't need to do that
theLiminator 2 days ago

> If you know what you’re doing and you know where to fit the LLMs (they’re genuinely the best system we’ve ever devised for interpreting and categorizing unstructured human input), they can be immensely useful, but they sing a siren song of simplicity that will lure you to your doom if you believe it.
I imagine using their embeddings and training a classifier on top of that is probably a lot more effective?
I've personally found agentic LLM workflows the most effective as extremely sophisticated autocomplete. Instead of autocompleting the current next few tokens, I tell it precisely how to edit my code at a high level. You can't tell it stuff at a feature level, but telling it how to implement the feature saves me a ton of time.
- roughly a day ago
  
  > I imagine using their embeddings and training a classifier on top of that is probably a lot more effective?
  I’d be interested in seeing this in action - I think the vector embeddings are underused generally - but my understanding is that’d be for something closer to sentiment analysis? In this case I’m talking about a setup closer to where you’ve got an LLM agent with a set of tools that’s interpreting a user’s request to identify which of those tools are the right ones to use. The requests can be complex, and involve multiple tool runs or chaining. If that’s doable by more deterministic mechanisms, I’d (genuinely) love to hear about it.
- peab 2 days ago
  
  If you're only working on one problem that's very valuable to solve, then taking the time to train a classifier is great.
  The beauty of LLMs is that you can run a ton of experiments, notebooks, demos etc because you can write classifiers and structure unstructured data so fast, in a reasonably accurate way (at the moment it seems roughly in line with say hiring an intern to label things)
mips_avatar 2 days ago

I now think the key is you avoid long running conversations. If the piece didn’t work out by the time you hit 200k context on Claude you are going to start over. Take whatever wins you learned from the first stab and give those insights to the model on round two, but throw the code out.
- mips_avatar 2 days ago
  
  Maybe Claude’s long running agent should just be hunting for any wins during the first 200k chucking it away and seeing if those wins change the initial goal.
morkalork 2 days ago

Just for getting a frame of reference, how many people were involved over how much time building a workflow with hundreds of agents?
- roughly 2 days ago
  
  I’ve seen a couple solo efforts and a couple teams, but usually a few months. It tends to evolve as a kind of whack-a-mole situation - “we solved that failure case/hallucination, now we’re getting this one.”
  - ineedasername 2 days ago
    
    My sense is that these are organizations where they probably recreated, with some minor details changed, the same problems they already had. Under planned and over-engineered in the attempt to fix and if things ever work it's more from some awful meta stable chaos than anything else.

_boffin_ 2 days ago

…it really feels like they’re attempting to reinvent a project tracker and starting off from scratch in thinking about it.

It feels like they’re a few versions behind what I’m doing, which is… odd.

Self-hosting a plane.io instance. Added a plane MCP tool to my codex. Added workflow instructions into Agents.md which cover standards, documentation, related work, labels, branch names, adding of comments before plan, after plan, at varying steps of implementation, summary before moving ticket to done. Creating new tickers and being able to relate to current or others, etc…

It ain’t that hard. Just do inception (high to mid level details) create epics and tasks. Add personas, details, notes, acceptance criteria and more. Can add comments yourself to update. Whatever.

Slice tickets thin and then go wild. Add tickets as your working though things. Make modifications.

Why so difficult?

threecheese 2 days ago

This is actually very interesting I think, as Anthropic pushes against The Bitter Lesson a bit! The model is a great reasoner, but we still need a concrete way to manage tasks - like we needed for tool calling. Claude Code has an opinionated loop, something like ReAct/CoT etc with prompting tricks for tasks/skills/etc, but here they add a Hierarchical Controller/Worker thing leveraging the Claude SDK. Mixing agency with actual control using program logic - not just alignment using prompts screaming in all caps and emoji.
We are going to break out of the coding agent’s loop in this way - it’s sorta curving back around to Workflows, after leaving them behind for agency, but right now we need to orchestrate this with deterministic code written mostly by humans - like the git repo anthropic shared. This won’t last long.
imron 2 days ago

This was my take.
They’ve made an issue tracker out of json files and a text file.
Why not hook an mcp to an actual issue tracker?
- _boffin_ 2 days ago
  Used an LLM to help write the following up as I’m still pretty scattered about the idea and on mobile.
  ——
  Something I’ve been going over in my head:
  I used to work in a pretty strict Pivotal XP shop. PM ran the team like a conductor. We had analysts, QA, leads, seniors. Inceptions for new features were long, sometimes heated sessions with PM + Analyst + QA + Lead + a couple of seniors. Out of that you’d get:
  - Thinly sliced epics and tasks - Clear ownership - Everyone aligned on data flows and boundaries - Specs, requirements, and acceptance criteria nailed at both high- and mid-level
  At the end, everyone knew what was talking to what, what “done” meant, and where the edges were.
  What I’m thinking about now is basically that process, but agentized and wired into the tooling:
  - Any ticket is an entry point into a graph, not just a blob of text. - Epics ↔ tasks ↔ subtasks - Linked specs / decisions / notes - Files and PRs that touched the same areas
  - Standards live as versioned docs, not just a random Agents.md:
  - Markdown (with diagrams) that declares where it applies: tags, ticket types, modules. - Tickets can pin those docs via labels/tags/links.
  - From the agent’s perspective, the UI is just a viewer/editor. - The real surface is an API: “given this ticket, type, module, and tags, give me all applicable standards, related work, and code history.”
  - The agent then plays something like the analyst + senior engineer role: - Pulls in the right standards automatically - Proposes acceptance criteria and subtasks - Explains why a file looks the way it does by walking past tickets / PRs / decisions
  So it’s less “LLM stapled to an issue tracker” and more “that old XP inception + thin-slice discipline, encoded as a graph the agent can actually reason over.”
- beefnugs 2 days ago
  
  Has any project tried forcing a planning layer as //TODO all throughout the code before making any changes? small loops like one //TODO at a time? What about limiting changes to a function at a time to remain focused? Or is everyone a slave to however the model was designed and currently they are designed for giant one-shot generations only?
  Is it possible that all local models need to be better is more context used to make simpler smaller changes at a time? I haven't seen enough specific comparisons of how local models fail vs the expensive cloud models.
adamgordonbell 2 days ago

I did find beads helpful for some of this multi-context window tasks. It sounds a little like there is some convergence between what they are suggesting and how it give you light weight sub tasks that survive a /clear.
- _boffin_ a day ago
  
  Love the show!
  > It sounds a little like there is some convergence between what they are suggesting and how it give you light weight sub tasks that survive a /clear.
  I do see the convergence there. Beads gives you that "state that survives `/clear`," and Anthropic’s harness tries to do something similar at a higher level.
  I've been thinking about this with a pretty simple, old-school analogy:
  You're at a shop with solid engineering and ticketing practices. You just hired a great junior developer. They know the stack, maybe even the domain basics, but they don't yet know:
  - Your business processes
  - The quirks of your microservices
  - Local naming conventions, standards, etc.
  - Team norms around testing, logging, and observability
  You trust them with important tasks, but expect their context will frequently get blown away by interruptions, meetings, task-switching, and long weekends. T handle this, need to make sure each ticket or note contains enough structured info so that when they inevitably lose context, they can pick right back up.
  For each ticket, you'd likely include:
  - Personas and user goals
  - Acceptance criteria, Given/When/Then scenarios
  - Links to specs, documentation, related tickets, or prior art
  - A short summary of their current understanding
  - Rough plan (steps, what's done/not done)
  - Decisions made and their rationale ("I chose X because Y")
  - Open questions or known gotchas
  End of day Friday, that junior would ideally leave notes that answer: "If I have total amnesia next Tuesday, what's the minimum needed to quickly reload my context?"
  To me, agent harnesses like Anthropic's or Beads are just formalizing exactly this pattern:
  - `/clear` or `/new` is like a "long weekend brain wipe."
  - Persistent subtasks or controllers become structured scaffolding.
  - The crucial piece isn't remembering everything, just clearly capturing intent, decisions, rationale, and immediate next steps.
  My confusion about Anthropic’s approach is why they're doing this over plain text files or JSON, instead of leveraging decades of existing tracker and project-management tooling—which already encode this exact workflow and best practice.
tomwojcik 2 days ago

Did you mean plane.so instead of plane.io?
- imron 2 days ago
  
  I assumed they meant https://github.com/makeplane/plane
  - _boffin_ 2 days ago
    
    Correct. My bad

rancar2 2 days ago

Having done this for Gemini CLI to get it to behave well several months ago to have a non-coding LLM CLI without costs, I can attest that these tips work well across CLIs.

adidoit 21 hours ago

Fascinating that the state-of-the-art in building agentic harnesses for long running agent workflows is to ... "use strong-worded instructions"

Anthropomorphism of LLMs is obviously flawed but remains the best way to actually build good Agents.

I do think this is one thing that will hold enterprise adoption back: can you really trust systems like these in production where the best control you can offer is that you're pleading with it to not do something?

Of course good engineering will build deterministic verification and scaffolds into prevent issues but it is a fundamental limitation of LLMs

ford 2 days ago

I think we take git for granted as software engineers Software engineering has decades of experience with proposing changes, merging them, staging them, deploying them, and rolling them back, and collaborating with other code-writers (engineers and agents).

I'm very interested in what this will look like for outputs from other job functions. And if we'll end up with a similar framework that makes non-deterministic, often-wrong LLMs easier to work with.

daxfohl 2 days ago

IME a dedicated testing / QA agent sounds nice but doesn't work, for same reasons as AI / human interaction. The more you try to diverge from the original dev agent's approach, the less and less chance there is that the dev agent will get to where you want it to be. Far more frequently it'll get stuck in a loop between two options that are both not what you want.

So adding a QA agent, while it sounds logical, just ends up being even more of this. Rather than converging on a solution, they just get all out of whack. Until that is solved, far better just to have your dev agent be smart about doing its own QA.

The only way I could see the QA agent idea working now is if it had the power to roll back the entire change, reset the dev agent, update the task with some hints of things not to overlook, and trigger the dev process from scratch. But that seems pretty inefficient, and IDK if it would work any better.

awayto 2 days ago

> Run pwd to see the directory you’re working in. You’ll only be able to edit files in this directory.

If you're using the agent to produce any kind of code that has access to manipulate the filesystem, may as well have it understand its own abilities as having the entirety of CRUD, not just updates. I could easily see the agent talking itself into working around "only be able to edit" with its other knowledge that it can just write a script to do whatever it wants. This also reinforces to devs that they basically shouldn't trust the agent when it comes to the filesystem.

As for pwd for existing projects, I start each session running tree local to the part of the project filesystem I want to have worked on.

CurleighBraces 2 days ago

I wonder how good these agents would be using something like cucumber and behaviour driven development tools?

wild_egg 2 days ago

Basically experts at it, works great.

dangoodmanUT 2 days ago

> … the model is less likely to inappropriately change or overwrite JSON files compared to Markdown files.

Very interesting.

johndhi 2 days ago

Why do we need long running agents? Most of my experienced value with LLMs has been like 1 to 10 turn chats. Should they just ban longer chats to solve these issues?

vidarh 2 days ago

Because you get the biggest time-savings when you can let it run longer between each time it needs a human in the loop.
I have multi-week runs of Claude Code going to work on a compiler project. I have a week-long run of Claude Code where it is writing a real-time strategy game.
In both cases I occasionally review code, and complain a bit about things it has gotten wrong until it's back on track. In both cases it is working to specs that have produced plans that have produced TODO lists. In the latter it wrote the specs itself. In the former, the specs are externally imposed (rubyspecs test suite).
In both cases it means I get involved ranging from ever tens of minutes to every few hours, but mostly then to just confirm it can continue, with more detailed reviews every day or so.
Having to review output and give instructions every turn would drastically diminish the value.
- maxlamb 2 days ago
  
  Roughly how much per day do these multi-week run end up costing?
  - vidarh 2 days ago
    
    I'm on the top Claude Max tier. Just upgraded. I could probably make do with the lower one, but I hit the limit (for the first time) on the lower Claude Max this week, and I get enough value from it that I was not about to wait for a session reset.
ford 2 days ago

"Why are we trying to make Yahoo Search faster? I already am fine with my 2-3s wait time"

slurrpurr 2 days ago

BDSM for LLMs