At ArcFusion, we believe in learning out loud. We share what we build, how we build it, and what we learn along the way. This post is adapted from one of our internal engineering knowledge-sharing sessions.
Most AI agents plateau. You deploy them, they perform reasonably well, and then, quietly, they stop improving. When they fail on an edge case, they fail the same way next time. There's no mechanism for them to learn from mistakes. The team patches prompts by hand, re-deploys, and hopes the next batch of real-world inputs doesn't surface something new.
This is the default state of almost every production AI deployment today. The agents aren't the problem. The missing feedback loop is.
We built one. This post walks through the automated prompt engineering workflow we use in production: an orchestrator agent that coordinates evaluation, optimization, and version control, taking our document parsing accuracy from 86% to 99.8% without a single manual prompt tweak after the loop was running.
The Problem Worth Solving First
Before we get into the architecture, let's be concrete about what we were trying to solve, because the technique generalizes far beyond our specific use case.
We build AI systems that parse construction documents. When someone uploads a Bill of Quantities (BOQ) spreadsheet, our platform needs to understand its structure: which column is the item number, which holds quantities, where one section ends and another begins. Every contractor formats these documents differently. Merged cells, inconsistent headers, metadata rows stuffed above the actual data. The failure mode isn't dramatic; it's quiet. The system parses most files correctly and silently misclassifies the edge cases.
The business impact is direct: a misclassified column means wrong data flows downstream, and a human has to catch and correct it. At scale, that's not a quality problem. It's a cost problem.
We needed an LLM to handle the classification reliably, and we needed a way to improve that classification systematically, without a team of engineers manually tweaking prompts in an endless loop.
Because the system runs on every uploaded file in production and per-call cost matters, we chose Gemini Flash Lite, a small, fast, cost-efficient model. The catch: small models are significantly more sensitive to prompt wording. A prompt that works perfectly on ten test files can silently fail on the eleventh because a contractor used an unexpected column label. This is where most teams give up and accept "good enough." We didn't.
The Architecture: An Orchestrator Agent With a Self-Improvement Loop
The core idea is straightforward: instead of humans reading failure reports and hand-editing prompts, an orchestrator agent does it, guided by a deterministic evaluation harness and structured optimization techniques.
Here's how the agents are arranged:
- Orchestrator (Claude Opus): Owns the overall optimization strategy. Decides when to stop iterating, manages ground truth updates, and reviews prompt changes for over-specificity. This is the "senior engineer" in the loop.
- Prompt Engineering Agent (Claude Sonnet): Does the actual prompt modification work. Reads the evaluation report, applies structured techniques (few-shot example selection, instruction rewriting, constraint tightening), and commits the updated prompt.
- Evaluation Harness: A deterministic script that runs the production prompt against a labeled test suite and produces structured failure reports. This is not an agent. It's a script, and that's intentional. Determinism matters when you're tracking optimization progress.
The orchestrator coordinates these components. It doesn't hand-craft prompts. It directs the prompt engineering agent toward specific failure patterns, interprets the evaluation results, and decides whether to continue, stop, or update the ground truth. Think of it as a master agent directing specialist sub-agents toward a clearly defined objective.
The Evaluation Harness: Measuring the Right Thing
Before optimizing anything, we needed a way to measure. We built a deterministic evaluation script that:
- Reads a set of BOQ test files with manually annotated ground truth (expected column classifications and row boundaries per file).
- Runs each file through the production prompt using the Gemini model.
- Compares LLM output against ground truth, cell by cell.
- Produces a structured report showing overall accuracy, per-file pass/fail, and critically, which specific cells failed and why.
# Simplified evaluation loop
for test_file in test_suite:
ground_truth = load_ground_truth(test_file)
llm_result = classify_with_prompt(test_file, current_prompt)
for cell in ground_truth:
if llm_result[cell.ref] != cell.expected:
failures.append({
"file": test_file.name,
"cell": cell.ref,
"expected": cell.expected,
"got": llm_result[cell.ref],
"context": cell.surrounding_rows
})
accuracy = 1 - (len(failures) / len(ground_truth))
report.add_run(prompt_version, accuracy, failures)
The verbosity matters. When a prompt change causes a regression, we need to see not just "accuracy dropped 2%" but the exact cell, its content, what was expected, and what the model returned. That level of detail is what makes the next optimization step actionable, whether for a human or an agent.
Every evaluation run is saved as a report, creating an audit trail of what was tried and what happened. This prevents the team from re-attempting approaches that already failed — a surprisingly common time sink in prompt engineering.
Measuring What Production Actually Depends On
The first version of our eval script scored the LLM output row-by-row. This felt thorough, but it was measuring the wrong thing.
When we traced how the application actually consumes the LLM output, we found it only ever reads three numbers: header_start_row, header_end_row, and data_start_row. These three boundary values are what get written to the database and used to slice every spreadsheet. All the intermediate row-type labels collapse into just those three integers.
This matters because our first eval was giving us false confidence. In some cases, the prompt was misclassifying a row between header types, but both types produced identical boundary values. The eval reported a failure; the app behavior was actually correct. Meanwhile, a genuine regression (where data_start_row was off by one) was going undetected.
We rewrote the eval script to score what production actually reads. Overall accuracy dropped from 97.9% to 94.3% on the same prompt. The prompt hadn't changed. We'd finally started measuring the right thing.
The lesson that generalizes: before you optimize, make sure your metric measures what production actually depends on, not what seems like the natural thing to score. This is as true for recommendation engines as it is for document parsers.
The Optimization Loop
With the evaluation harness in place, the orchestrator runs a structured optimization cycle:
- Load context: the prompt engineering agent reads the latest evaluation report, including accuracy scores and specific failure patterns.
- Load optimization techniques: the agent has access to a structured skill covering few-shot example selection, instruction rewriting, and constraint tightening.
- Iterate: the agent modifies the production prompt, runs the evaluation script, reads the new report, and decides whether to continue optimizing or stop.
Each iteration produces a commit with the modified prompt and the evaluation results, creating a full git history of the optimization trajectory:
Prompt v0 (baseline): overall 86.0% | no KEY RULES, minimal few-shot reasoning
Prompt v1: overall ~95% | renamed output key, added KEY RULES, extended few-shot
Prompt v2: overall 94.3% | single-vendor fallback rule (boundary scoring)
Prompt v3: overall 100.0% | hybrid row pattern, section-break rule, empty column rule
Prompt v4: overall 100.0% | BOQ reference disambiguation, vendor label normalization
Prompt v7: overall 99.4% | flash-lite optimizations, 2 new examples
Prompt v8: overall 99.8% | no-vendor pricing group rule (5-run avg, flash model)
Prompt v10: overall 97.7% | two-row header pattern, anti-mistake reminders (lite, 14 cases)
We use Claude Sonnet for the prompt engineering work itself — capable enough for modification and evaluation interpretation, and the higher quota on Claude Code makes it practical for many iterations. The orchestration layer (deciding when to stop, managing ground truth updates) runs on Claude Opus.
Not All Optimization Techniques Are Equal
Here's a finding that surprised us, and has direct implications for anyone running similar loops.
During one optimization cycle, we hit a persistent failure pattern on test cases involving a two-row header format the model wasn't recognizing. Rather than immediately writing a new few-shot example, we ran a controlled experiment — testing five different prompt interventions against the same baseline failure:
| Approach | Avg Overall (3+ runs) | Notes |
|---|---|---|
| Full few-shot example | 100% (3/3 perfect) | Most reliable; adds ~70 lines to prompt |
| Anti-mistake reminders v2 + inline pattern | 94.3% (5 runs, 2/5 perfect) | Concrete example helped, but not reliable |
| Anti-mistake reminders v1 | 95.2% (3 runs, 1/3 perfect) | Good but stochastic on edge cases |
| Classification procedure (structured CoT) | 90.5% (3 runs, 0/3 perfect) | Step-by-step confused model on related sub-rules |
| Pre-classification checklist | 90.5% (3 runs, 0/3 perfect) | Checklist after examples was weakly followed |
| No prompt changes (baseline) | 82.5% | Systematic failure; pattern not recognized |
The full few-shot example won every time. "Anti-mistake reminders" are targeted negative examples listing common errors. They reached ~95% average without adding a new example, but couldn't get the small model to 100% reliability. The structured checklist approach, which we expected to help, actually made things worse.
The practical implication: for small models, demonstrating a pattern through a concrete few-shot example is the only reliable way to reach 100% on novel failure modes. Rules-only techniques are useful for fast iteration, but they hit a ceiling.
What Small Models Actually Need from Prompts
Partway through optimization, we tried to reduce prompt size. The production prompt had grown to ~619 lines (~32,500 chars, ~8,100 tokens), and we assumed some of it was redundant.
We were wrong.
We tested five compression levels: removing footer notes, compressing KEY RULES into telegraphic bullets, shortening per-example reasoning fields. At every level, accuracy dropped:
| Approach | Size reduction | 3-run avg accuracy |
|---|---|---|
| Original (baseline) | 0% | 100.0% |
| Footer-only removal | 5.7% | 98.5% |
| Notes + preambles removed | 16% | 94.1% |
| Conservative compression | 28% | 90.2% |
| Aggressive compression | 47% | 94.3% |
Even removing just the "Notes: none" footer lines — a 5.7% reduction — introduced stochastic failures the original prompt avoids. What the failure modes revealed:
- Small models require verbose chain-of-thought. Compressing example reasoning fields from ~153 chars average to ~62 chars average removes the step-by-step reasoning the model depends on. It cannot infer reasoning chains from terse notation.
- KEY RULES must stay in prose format. Converting multi-sentence explanations to abbreviated bullet syntax caused catastrophic failures, including 0% column type accuracy in one run.
- Note preambles are teaching content, not decoration. The explanatory blocks between examples provide advance framing that helps the model correctly interpret what follows.
- Redundancy is intentional reinforcement. The final prompt communicates each critical rule in three places: KEY RULES (definition), Note preamble (contextualization), and Reason field (demonstration). This triple-reinforcement is necessary for smaller models. The apparent verbosity is the optimization.
The corollary for teams choosing model tiers: token savings for small models should come from model upgrades, not prompt compression. If you're on a cost-constrained model, your prompt will need to be longer, more verbose, and tested more carefully. Factor that engineering cost into the model selection decision.
Temperature Is Not Free
We discovered this the hard way. Setting temperature to 0 (deterministic) caused one test case to enter an infinite repetition loop. The model repeated the same sequence of values until hitting the token limit. At T=0.2 the loop persisted. At T=1 it resolved.
The complication: frequencyPenalty isn't supported on Gemini Flash Lite models, so there's no parameter to tune your way out of it. We now run production at T=1 and accept ~2% stochastic variation (occasional formatting glitches, not classification errors) as the cost of avoiding repetition-loop failures.
| Model | T=0 | T=1 | Key tradeoff |
|---|---|---|---|
gemini-2.5-flash-lite | 100% (repetition loop on 1 case) | 95–100% (stochastic) | Best accuracy; use T=1 |
gemini-flash-lite-latest | 93.4% (no repetition loop) | 93.4% | Fixes T=0 but loses classification accuracy |
gemini-2.5-flash | N/A | 99.8% avg (5-run) | Most reliable; higher cost |
We also discovered that model aliases matter. Switching from gemini-2.5-flash-lite to gemini-flash-lite-latest — which resolves the T=0 repetition issue — introduced new systematic failures with different rows misclassified. Same prompt, different checkpoint, different failure modes.
The takeaway: model aliases in any API may represent different checkpoints with meaningfully different behavior. Pin to a specific model version for production, and re-run your eval harness whenever you change it. This applies to any provider, not just Google.
The Overfitting Question
The most important challenge in any optimization loop is overfitting. If you optimize a prompt against a fixed test set, you inevitably fit to the idiosyncrasies of that data. The next real-world input from a new contractor breaks things.
We don't have a perfect answer, but here's our current approach:
Manual ground truth evolution. As new document formats arrive, we add them to the test suite and re-annotate. The evaluation set grows over time, representing an increasingly diverse slice of real inputs. It's labor-intensive, but it keeps the test set honest.
Ground truth itself needs maintenance. During our last optimization session, we corrected five ground truth files. These weren't labeling errors — but cases where our initial understanding of "correct" behavior was wrong. Running the eval against stale ground truth optimizes the prompt toward incorrect behavior. This creates an important discipline: treat ground truth corrections as a careful, separate step with its own review, not something that happens automatically during optimization runs.
Prompt review for over-specificity. After each optimization run, we manually read the prompt and look for patterns that handle one contractor's exact phrasing rather than the general category. The evaluation script helps catch regressions from this generalization.
Production monitoring as a feedback signal. When a user manually changes a classification the system assigned, that's an implicit signal the model got it wrong. We're working toward piping these corrections back into the test suite automatically, creating a feedback loop where production usage continuously expands the evaluation set.
Our team drew a parallel to recommendation engine monitoring: you track a metric, detect drift, and trigger re-optimization when the metric drops. The difference is that our "ground truth" requires domain judgment. There's still a human in the loop for that part. We don't expect to eliminate that. We expect to make it less frequent.
What This Costs
One practical note on agent economics.
Using a less capable model as the optimization worker cost roughly $20 over ~20 hours of continuous operation. Switching to Claude Sonnet completed the same work in significantly less time. Higher per-token cost, but fewer tokens wasted on retries and misunderstandings. For our team of ~15 engineers, total Claude Code spend currently sits under $100 per person per month.
The real cost question isn't "which model is cheapest per token" but "which model completes the task in the fewest iterations." A model that takes 20 hours of fumbling costs more in wall-clock time and developer attention than one that finishes in 3 hours at a higher token rate.
The same logic applies to the production model choice. Flash Lite costs less per call than full Flash, but prompt engineering for Flash Lite takes more effort. The prompt must be longer, more verbose, and tested more carefully. The total optimization cost (engineering time plus eval compute) often exceeds the per-call savings unless you're running at very high volume.
What's Still Open
This workflow is effective but not fully autonomous. The gaps we're actively working on:
Automated test suite expansion. Adding new document files to the evaluation set still requires manual annotation. We're exploring whether a second LLM (with a different prompt) can generate candidate ground truth for human review, reducing annotation burden without removing human judgment.
Cross-client generalization testing. We don't yet have a formal holdout set — meaning a set of files the optimizer never sees during tuning. One approach we're exploring: using an LLM to generate synthetic out-of-sample variations as a proxy holdout.
Prompt drift detection. After many rounds of optimization, prompts accumulate cruft. Examples and instructions relevant to early test cases that no longer contribute. We need a pruning step, but every component currently appears load-bearing, so pruning requires re-running the full eval suite against each removal.
The Takeaway
If you're deploying AI agents against any structured output task (document parsing, data extraction, classification), the combination of a deterministic evaluation harness and an LLM-powered optimization loop is remarkably effective. The key ingredients are: verbose failure reports (not just accuracy numbers), a metric that measures what production actually depends on, a version-controlled prompt history, and a human who periodically reviews both the prompt and the ground truth.
A few things we didn't expect going in:
- The right evaluation metric changed our apparent accuracy by nearly 4 percentage points without changing the prompt at all.
- Small model prompts hit a minimum viable size. Below it, compression always causes regressions.
- Ground truth is not static. It evolves as your understanding of the task matures, and stale ground truth is as dangerous as a bad prompt.
- Temperature T=0 can cause degenerate outputs on small models that don't support frequency penalties.
We went from 86% to 99.8% accuracy (full Flash model) and 97.7% average (Lite model) using this approach, with Claude Code as the orchestrator and Gemini Flash Lite in production. The total cost of the optimization runs was negligible compared to the engineering time it would have taken to hand-tune the same prompt.
The broader point for any organization deploying AI agents: the difference between an agent that plateaus at "good enough" and one that continuously improves is not model capability. It's the feedback loop. Build the harness first. The optimization follows.
We're hiring engineers who want to solve problems at the intersection of domain knowledge and applied AI. If building self-improving pipelines sounds interesting, check out our open roles at arcfusion.ai/careers.
Get in touch at arcfusion.ai.



