Teaching My AI to Think Twice: What I Learned Building a Self-Correcting Summarization System

Published on

I recently built a self-correcting summarization workflow using LangGraph, and it turned out to be one of those projects that teaches you as much about system design as it does about AI.

The application implements an iterative refinement loop with three agents:

  • A Summarizer that generates the initial draft.
  • A Judge that evaluates quality and provides feedback.
  • A Refiner that incorporates the critique.

The loop continues until the Judge approves the output. I deployed it using Flask and Render, so users can submit text and watch the refinement process happen in real-time.

What Worked Well

The most satisfying part was seeing the state machine come together. LangGraph's node-based architecture forces you to think explicitly about state transitions. Each node produces output and determines the next step.

Using Pydantic for structured outputs made a huge difference. By defining a schema with should_refine: bool and feedback: str, I turned the LLM from something unpredictable into a reliable component. Type safety makes the whole system easier to reason about.

I also appreciated how LangGraph handles the orchestration layer. You're not manually managing callbacks; you're just defining nodes and edges. It keeps the code clean and lets you focus on the business logic.

The Hard Parts

State management was trickier than I expected. Each node needs to explicitly return the keys you want to preserve, and I initially had a node that was overwriting the entire state instead of merging updates. The symptom was subtle: everything would work until the final node, which would fail with "summary not found". The lesson was clear: implicit state updates don't work in these architectures.

The other challenge was enforcing structured output from the LLM. Even with clear prompts, the Judge would occasionally return extra commentary outside the JSON schema, breaking the parser. I switched to Gemini's with_structured_output() method, which validates at the API level and solved it completely.

What I Learned

  • Structured validation scales. Pydantic schemas might feel like overhead initially, but they pay dividends when you're debugging or extending the system.
  • State machines clarify control flow. LangGraph's explicit state management made debugging much easier. When something broke, I could trace exactly where data was being lost or transformed incorrectly.
  • Production reveals assumptions. Deploying to Render exposed timing issues, request timeouts, and serialization quirks that never showed up locally.
  • AI workflows are still software. The LLM is a component, but the real work is in validation, error handling, and system design.

Next Steps

Now that it's stable, I'm adding instrumentation to track performance:

  • Logging disagreement rates between the Judge and Summarizer.
  • Recording refinement loop counts and feedback patterns.
  • Adding async support for batch processing.

The broader takeaway is that iteration matters more than getting everything right the first time.