Blog

Jul 14, 2025

Building Evals...So Our Emails Don't Suck

I built an AI email system that initially produced terrible, generic outreach. I realized the problem wasn't the AI itself, but my complete lack of evaluation infrastructure to measure and improve what "good emails" actually look like.

The AI Email Confession

My favorite part of every sales call was dropping this line near the end of a sales call, “I should probably tell you that the email you responded to was written by AI.”

And if they sat up a little straighter, I’d do a little mental fist bump knowing their curiosity was piqued.

There was some semblance of hope for AI-powered outbound, one not dominated by generic compliments, overly formal language, and a paradoxical combination of being overly personal while still feeling templated.

I’ve used that line since the first days of selling Blossomer, but I remembered the early days when that line would make me cringe internally.

Yes, the email they received was generated by AI, but getting there required 50+ changes to the underlying prompt just to produce something that vaguely sounded like me.

I also didn’t share that I’d need to go through the same laborious process to achieve the same results for them.

I definitely didn't mention that our AI-powered system required constant manual tweaking. Any changes to template, targeting, or approach meant going back to tediously fiddling with prompts.

As I got more immersed in the technical side of how AI worked, I could see a glimmer of hope with each new experiment I ran, trying new techniques around prompt engineering and managing context.

As I dove deeper into building AI systems (dusting off the rust from my software engineering days), I started experimenting with techniques around prompt engineering, context management, and retrieval systems.

Each approach made this process slightly more effective and moderately more robust.

The missing piece was learning about evaluation systems. It completely reframed the problem from "AI isn't good enough" to "How do I define, measure, and enforce what good actually means."

I'm really proud of this journey, and I'd love to share what I learned.

The Problem

Problem #1: AI’s Definition of “Good”

LLMs are fundamentally just next-token prediction engines. Given a sequence of words, what's the most statistically probable word that comes next?

It’s incredible that we can achieve something that resembles human intelligence based on this deceptively simple mechanism.

Some might claim that’s an indictment on human intelligence, but that feels like a rant for another day.

Asking an advanced autocomplete to write your emails is exactly why they sound so terrible.

Models like ChatGPT, Claude, and Gemini are trained on all the content on the internet. This includes millions of "professional" emails, the vast majority of which come from corporate communications, newsletters, and support tickets.

When you ask it to write you a great cold email, it will reach for the statistical average of all the emails it's ever read—ones filled with corporate buzzwords and fake enthusiasm.

Even worse, now that everyone’s asking AI to write their emails, your prospects’ inboxes are now flooded with that same statistical average.

In a race to standout, a "great" cold email becomes worthless if everyone's sending identical variations.

It's the Law of Shitty Clickthroughs for outreach.

Problem #2: Context Rot

Basic prompt engineering techniques delivered some immediate improvements. I defined roles, showed examples of my best emails, and articulated a founder's unique industry insight into the system prompt.

I saw substantial improvements, then diminishing returns, then performance plummeted once instructions exceeded a certain length.

I needed emails for different personas and campaigns, so I kept adding examples and instructions. The system prompt grew unwieldy and performance declined in subtle ways.

The model would grab the right template but forget the specific customization instructions.

Or it would start blending styles from conflicting examples. For example, it might open with startup casualness ("Hey friend!") but close with enterprise formality ("I look forward to a productive discussion").

The emails looked grammatically correct and well-structured, but subtle inconsistencies killed response rates.

Anthropic's research shows that models pay attention to the beginning and end of contexts while ignoring the middle.

Nelson Liu's paper, "Lost in the middle problem," also highlights that models are noticeably worse retrieving information in the middle of long contexts.

Even for models with massive context windows (e.g. GPT-4.1 or Gemini 2.5 Pro that exceed 1M+ tokens), performance declines well before hitting theoretical limits.

I learned that if the model can't effectively use it, more detail doesn't help.

Managing context is just as important as giving explicit instructions.

Problem 3 - What Works Today Breaks Tomorrow

Three months after landing on our best-performing prompt, response rates suddenly dropped 40%. The emails looked identical to what had been working.

I spent days debugging what felt like a broken system. I used the same prompt, the same examples, and the same targeting criteria.

But the market had shifted underneath us.

Every system is optimal under a certain set of conditions, but your environment might shift in ways that are hard to predict and harder to catch:

Model updates change behavior: OpenAI pushes updates that subtly alter how models interpret your prompt. What sounded natural in GPT-4 might sound stilted in the latest version.
Saturation of Best Practices: Your winning approach gets copied. People analyze what's working and flood the market with similar approaches. Prospects start seeing variations of your "unique" email from five other vendors.

Even if you can make your model more predictable, you can't engineer stability into environments that change so quickly.

The Solution

Building an Evaluation System

Up until this point, I've painted a pretty bleak picture of how to work with AI.

Models pull from terrible training data. Adding context to fix that creates new problems. And even when everything finally works, the smallest change in circumstances breaks it.

After months of wrestling with each issue separately, I had a realization: I was solving the wrong problem entirely.

The goal wasn't to eliminate these issues. It was to build infrastructure that helped me navigate them more effectively.

Instead of blindly tweaking prompts, I could track how changes actually impacted performance.

Instead of blindly adding context, I could measure what sort of context actually helped and when I'd hit diminishing returns.

I’d been obsessed about better outputs when I should have been building better scaffolding.

I needed evals.

At first, it seemed daunting, but I really benefited from some great literature out there (especially writing from Shreya Shankar and Hamel Husain).

Here's how we built our system, step by step and the lessons we learned along the way:

Step 1: Defining Your Evaluation Criteria

Before building any tests, I got specific about what “good” actually meant. This meant documenting my own best practices around what a great email sounds but also working directly with each founder to capture details about their unique voice.

Then, I broke this criteria into objective criteria I could measure automatically and subjective criteria that would often require human judgment.

For example:

I would also grab emails that actually got responses and have AI extract the patterns around structure, style, and tone.

Step 2: Design Your Test Cases

I adopted a two-stage evaluation framework that balances speed, cost, and accuracy based on our evaluation criteria:

Stage 1: Code-Based Evaluations to catch obvious failures fast and free
Stage 2: LLM-as-a-Judge to catch more nuanced criteria

Stage 1: Code-Based Evaluations

Code-based evaluations are essentially unit tests for LLM outputs. They use programming logic to assess specific, measurable aspects of the generated content without requiring additional LLM calls.

This approach is both faster and cheaper than LLM-as-judge methods, making it ideal for catching obvious failures before moving to more expensive evaluations.

Each objective criterion from Step 1 became a programmatic test:

def check_email_length(email_body: str) -> dict:
    """D-5: Validates email body is between 50-100 words."""
    word_count = len(email_body.split())
    
    if word_count < 50:
        return {"pass": False, "reason": f"Too short: {word_count} words (min 50)"}
    if word_count > 100:
        return {"pass": False, "reason": f"Too long: {word_count} words (max 100)"}
    return {"pass": True, "reason": f"Valid length: {word_count} words"}

These code-based evaluations caught structural and formatting issues that would otherwise require manual review or expensive LLM judgment calls. We could test for:

Correct structure of output: Ensuring emails followed our expected template and contained required fields
Specific data requirements: Verifying that personalization fields were populated and spam trigger words were avoided
Format compliance: Confirming mobile-friendly formatting and appropriate length constraints

When we first started iterating on prompts, this approach eliminated over 50% of problematic emails instantly.

Stage 2: LLM-as-Judge for Subjective Criteria

Code-based evaluations caught the obvious structural failures, but they couldn't assess our subjective criteria or nuanced qualities like authenticity, relevance, and genuine helpfulness.

Instead of relying on human reviewers (me) or trying to codify subjective judgment into rules, we used LLMs to evaluate other LLMs' outputs.

These were the qualities that required human-like judgment, like:

"Sounds like something I'd write to a coworker I'm friends with"
"Shows genuine interest through specific observations"
"Comes across as helpful, not salesy"

We even added one more crucial evaluation at the end: simulating whether a busy founder would actually respond to this email:

evaluation_prompt = """

You are an expert evaluator assessing email generation quality for founder-to-founder outreach.

Your task is to evaluate the generated email against these six key criteria:

1. **EMAIL NATURALNESS**: Does this sound like a real founder wrote it?
   - Natural, conversational language vs corporate speak
   - Authentic voice and genuine helpful tone
   - No templated or forced language patterns

2. **GENUINE HELPFULNESS**: Does this feel like the sender actually wants to help?
   - Avoids rhetorical questions that sound like fake friendliness
   - No hollow sales tactics or manipulation
   - Reads like someone genuinely interested in helping vs just selling
   - No "right?", "don't you think?", "I bet that's tough" type language

3. **PERSONALIZATION APPROPRIATENESS**: Is the personalization appropriate OR is the email hyper-relevant?
   - Traditional personalization: Relevant without being creepy, shows research but isn't intrusive
   - OR Hyper-relevance: Email content is so relevant to recipient's role/business that it doesn't need personal details
   - Content that directly addresses recipient's likely pain points can substitute for traditional personalization
   - Naturally integrated, not forced

....
"""

This became our final gate. We ran this evaluation multiple times per email, and if the majority of responses said "no," the email failed regardless of other quality assessments.

I was skeptical, but the approach proved remarkably effective.

Research shows that LLM judges often match human-human agreement rates while evaluating at scale, quickly and consistently, at a fraction of the cost.

Of course, evaluations don't guarantee perfect behavior.

We would still catch errors sampling real email outputs or running the evals on our curated test datasets.

But with this system in place, we can tweak prompts much more effectively and efficiently layer in human spot-checks for the edge cases that matter most.

Step 3: Expand Tests to Each Step in Your Workflow

We couldn't just evaluate individual emails in isolation because they were the output of a multi-step workflow that could fail at any stage.

Email generation was one step in a serial workflow where the output of one LLM call feeds into the input of the next.

Before crafting a cold email, we had specialized agents for different parts of outreach: one for researching leads, another for qualifying them as good fits, and a third for pulling in the relevant messaging strategy.

Because each client had a unique product offering with different targeting and messaging strategies, we couldn’t craft a great email without quality context upstream.

Each gate used the same evaluation framework from Steps 1 and 2—deterministic checks plus LLM judges—but with criteria specific to that stage:

Research Gate: Is there sufficient context about the lead? Can we construct a detailed company and persona profile for them? (e.g. tech stack, revenue range, job title, etc.)
Qualification Gate: Based on our research, how well do they fit our ICP? Is this a high-priority lead worth reaching out to? (e.g. lead score, confidence level, prioritization rationale)
Strategy Gate: Are we selecting appropriate messaging framework and value props for this persona? (e.g. valid template found, pain points map to persona’s day-to-day role)

We also implemented staleness tracking. When Step 1 changed, we'd mark Steps 2-5 as stale and require regeneration. This prevented subtle inconsistencies where your email references outdated company information.

This workflow solved two fundamental problems:

Problem #1: Preventing cascading failures. Each step had to meet its own quality bar before proceeding. If company research failed to identify key account information, we'd try again or pause rather than let that gap contaminate downstream steps.

Problem #2: Managing context windows effectively. Instead of cramming everything into one massive system prompt (our old Context Rot problem), we could summarize only the most relevant parts at each stage. The email generation agent received a clean, focused summary rather than trying to parse through pages of raw context.

Step 4: Automating Prompt Iterations

In the early days, every time I reviewed a failed email, I'd manually diagnose what went wrong and tweak the prompt:

"Don't use rhetorical questions." "Avoid generic enthusiasm." "No formulaic transitions."

Initially, this manual debugging worked, but it didn't scale. I was pattern-matching individual failures rather than understanding systematic issues.

The breakthrough came when I realized our evaluation system was already generating detailed rationales for why emails failed. Instead of me analyzing these patterns, I could have an LLM do the analysis and suggest prompt improvements.

Our evaluations were already providing specific rationales for failures:

"genuine_helpfulness": 
  {"pass": false,
    "rationale": "The phrase 'That must be tough, right?' creates artificial intimacy and sounds manipulative rather than genuinely helpful. Rhetorical questions like this damage credibility.",
    "offenders": ["That must be tough, right?", "wouldn't you agree?"]
  }

Every week, we'd collect 50-100 failure rationales and feed them to an LLM analyst along with our current prompt:

iteration_prompt = f"""You are a prompt engineering expert analyzing systematic failures in an email generation system.

Current Email Generation Prompt:
{current_prompt}

Recent Evaluation Failures (past week):
{failure_rationales}

Your task:
1. Identify the 3 most frequent failure patterns across these examples
2. For each pattern, determine the root cause (what's missing or unclear in the current prompt?)
3. Suggest specific, actionable prompt modifications that would prevent these failures

Format your response as:

## Pattern 1: [Brief description]
**Frequency:** X out of {len(failure_rationales)} failures
**Root Cause:** [What's missing/unclear in current prompt]
**Suggested Addition:** [Specific text to add/modify in prompt]
**Example Fix:** [How this would have prevented failure X]

## Pattern 2: [Brief description]
**Frequency:** X out of {len(failure_rationales)} failures  
**Root Cause:** [What's missing/unclear in current prompt]
**Suggested Addition:** [Specific text to add/modify in prompt]
**Example Fix:** [How this would have prevented failure Y]

## Pattern 3: [Brief description]
**Frequency:** X out of {len(failure_rationales)} failures
**Root Cause:** [What's missing/unclear in current prompt] 
**Suggested Addition:** [Specific text to add/modify in prompt]
**Example Fix:** [How this would have prevented failure Z]

Focus on systemic improvements, not one-off fixes. 
Each suggestion should prevent multiple similar failures."""

Instead of me manually debugging why failures kept happening, the system would analyze the patterns, identify systematic gaps in our prompt, and suggest specific improvements.

The process was remarkably faster than manual analysis and kind of felt like I had pair-programmer for prompt engineering.

I really credit this step with changing our totally ad hoc, reactive posture to iterating on prompts into one that was much more deliberate and systematic.

Step 5: Learning from What Actually Worked

Once we have our evaluation and feedback loop in place, we could also start incorporating real-world data to improve the system further.

With our evaluation and feedback loop in place, we started incorporating real-world data to improve the system further.

We tracked all the subjective and objective elements that went into constructing each email (e.g. tone, personalization approach, value proposition, framing). Then we analyzed which emails led to responses and iterated our prompts to produce more of what was working.

We fed successful emails back into our evaluation system to understand what was actually driving results

Here were a few non-intuitive lessons we learned about cold outreach through this system:

Hyper-relevance > personalization: "Support teams swamped with QA" beat "I saw you went to Stanford" every time. Founders respond to business understanding, not hobbies.
3-4 word "boring" subjects: "Operations cost rising" looks internal; "Quick Question About Your Revenue Operations Strategy" screams cold outreach.
Stop using rhetorical questions : "That must be tough, right?" sounds manipulative. Direct statements like "Support teams spend 15 hours weekly on QA" consistently won.
Explicit uncertainty actually works: "Not sure if this applies to you, but..." gave founders an easy out. Forcing irrelevant details always backfired.
Low-commitment CTAs: "Can I send you a 2-min video?" dramatically outperformed "Can we schedule 30 minutes?" Pique interest instead of trying to convert over email.

Each of these discoveries became evaluation criteria and iterations in our prompt—until we saw that pattern no longer worked and needed to evolve again.

This created a virtuous cycle: Generate emails → Track responses → Analyze successes vs. failures → Update system → Generate better emails.

Conclusion

As builders, it's tempting to chase better outputs faster.

But you can't move quickly without solid foundations.

And you can't manage what you can't measure (one more business platitude for good measure).

The principles of good evaluation systems echo what seems to work everywhere else.

Whether it's writing code or developing strategy, define success clearly, understand why things fail, and build resilient systems that can adapt instead of over-optimized solutions.

I'm still tweaking this system every week, but having evaluation infrastructure means each iteration teaches us something rather than hoping the next prompt will magically fix everything.

The specific techniques matter less than the systematic approach to understanding what works.

Now, when a founder says my email feels "unique and personal," I feel all the growing pains were validated.

—-

Building Evals...So Our Emails Don't Suck

Building Evals...So Our Emails Don't Suck

The AI Email Confession

The Problem

Problem #1: AI’s Definition of “Good”

Problem #2: Context Rot

Problem 3 - What Works Today Breaks Tomorrow

The Solution

Building an Evaluation System

Step 1: Defining Your Evaluation Criteria

Step 2: Design Your Test Cases

Step 3: Expand Tests to Each Step in Your Workflow

Step 4: Automating Prompt Iterations

Step 5: Learning from What Actually Worked

Conclusion

Further Reading

Read more articles

Read more articles

Building Blossomer CLI

Building Blossomer CLI

Building Blossomer CLI

Dangerously Data-Driven

Dangerously Data-Driven

Dangerously Data-Driven

A New Chapter

A New Chapter

A New Chapter