How GC.AI Closes the Eval Loop with Raindrop Workshop

"The old loop was: run evals, wait, dig through traces, patch the issue, rerun, repeat. With Workshop, my coding agent is in the loop while the run is still happening. It can read the trace, see what broke, make the fix, and check the next result against GC AI's legal-quality evals."
Brian Rhindress
AI Engineer, GC.AI

GC.AI is building the next generation of in-house legal tooling, working with customers like Wayfair, Vercel, and Time. To ship a system attorneys can trust, the team employs a rigorous suite of evals.

GC.AI built a benchmark from a set of realistic legal tasks. LLM judges score every response across key in-house legal competencies like accuracy, issue identification, and risk assessment. GC.AI's auto-research pipeline clusters the scores across all traces to surface regressions so the team can make changes to prompts and tools.

The problem: iterating was slow and manual

Trace data lived in the cloud. The team wrote scripts to pull traces into an in-house viewer. Issues only surfaced after a full batch had finished, which often took hours to run.

Auditing runs and the LLM judging was manual. Every judge run had to be examined to confirm correct inputs and scoring, across thousands of chats.

"The process was slow. When a new model came out or we built new features, we had to inspect and dig through tons of traces."
Brian Rhindress
AI Engineer, GC.AI

Enter Raindrop Workshop

Raindrop Workshop is a local-first trace debugger for AI agents. Every LLM call, tool invocation, and judge result streams to the developer's machine in real time. A bundled MCP server exposes the same trace store directly to coding agents.

Now coding agents can read traces, find issues, test fixes, and push changes.

"Workshop is giving me and the agent the ability to look at evals without having to actually open the UI. The traces are right there. It actually closes the loop during development."
Brian Rhindress
AI Engineer, GC.AI

What changed: a new development loop

Workshop helped GC.AI tighten the loop in connection with its benchmark evals in four concrete ways:

1. Caught real bugs

GC.AI caught config issues across model runs and a model-provider error pattern caught at the SDK integration layer. Both came from the live trace stream, without needing to wait for a batch to finish to see something was off.

"I usually wait until the end to do judging. But since traces are coming in one at a time, I can start to look at things as they come in and detect issues and plan next steps while it's happening."
Brian Rhindress
AI Engineer, GC.AI

2. Hill-climbing using benchmark data

GC.AI also used Workshop to iterate on prompt and tool changes against GC.AI's benchmark. They push a change, run a small batch of holdout evals, and inspect each trace as it returns, all from inside a coding agent.

3. Validating the eval pipeline itself

Workshop doubles as a validation layer for GC.AI's eval pipeline. They used the MCP integration to pull judge runs from different traces side by side, confirm scoring is consistent, and compare full eval runs against each other.

"It's a validation mechanism. I've used it to pull judges from different traces next to each other and make sure the judges are working. I've used it to compare evals from different runs side by side."
Brian Rhindress
AI Engineer, GC.AI

4. Debugging without ever leaving Claude Code

Workshop's MCP means AI engineers never have to leave Claude Code. The GC.AI team was just able to ask Claude to root cause issues in traces, inspect parameters, and suggest fixes.

"It's nice not having to manage traces myself. I can go deep into a trace, review system prompt composition, tool and skill invocations, and make sure the inputs and outputs are what I expected."
Brian Rhindress
AI Engineer, GC.AI

The outcome: a self-healing development loop

With Workshop, the coding agent runs the entire eval loop from one coding agent session: inspect a trace, find the regression, write a new eval to reproduce it, propose a fix, verify against the next run, push the change.

About Raindrop Workshop

Raindrop Workshop is a local-first, open-source debugger for AI agents. Every LLM call, tool invocation, and reasoning step streams live to a browser UI on the developer's machine, and an MCP server exposes the same traces to coding agents. The coding agent can inspect traces, write evals, run them, propose fixes, and push the change back, all from a single agent session.

bash

curl -fsSL https://raindrop.sh/install | bash