The Coach Who Doesn't Flatter

30-second version

Most AI products optimize for “user feels good now.” A coach has to optimize for “user gets better over time.” Those two metrics diverge in roughly half the moments where the AI has to choose what to say. Building the coach version means picking against the short-term feel-good metric, repeatedly, on purpose. Three concrete product decisions follow from that choice.

Why this matters more for a coach

A pleasant assistant is fine. A pleasant tutor is fine as long as it is also accurate. A pleasant coach is a failed coach. The whole point of a coach is to tell you the things you do not want to hear, in a way that lands.

Most AI products are tuned for user satisfaction in the current session. This is reasonable for a search engine, a writing tool, a content generator. It is catastrophic for an interview coach, because the user’s feeling-good in this session is statistically correlated with not-getting-better by next month.

The hard product decision: pick the metric that lags by months over the metric that ships in the session. Most teams say they will. Few do, because the second metric is what the dashboards measure.

Where the conflict shows up

In an interview-coaching session, a moment-by-moment trace looks like:

User answers a behavioral question with a vague example.
Coach has two valid responses:
- A: “Great example! Here are some ways to make it stronger next time…”
- B: “That answer would lose me as an interviewer. Here’s why, and here’s the kind of structure that would not.”
A makes the user feel better right now.
B makes the user get better by next interview.

A user satisfied with A leaves the session feeling encouraged. A user who got B leaves the session feeling slightly worse — and performs better when it counts.

The cumulative product effect: a coaching app tuned for A is indistinguishable from ChatGPT in three sessions. A coaching app tuned for B has retention that grows as users notice their performance changing.

Three product decisions

Once you commit to “user gets better” as the metric, three concrete choices follow.

1. The coach is allowed to refuse to be encouraging

Most AI products treat encouragement as a default behavior. A coach treats encouragement as a tool — one of several, used when the diagnosis says encouragement is what the user needs.

In my system, the agent has access to a tone parameter that is set not by the user but by the coach’s diagnosis. If the user is showing avoidance behavior (skipping practice, only asking for positive feedback, repeating the same easy questions), the diagnosis sets tone: confront. The agent’s response in this state is calibrated to be uncomfortable in a productive way.

The user cannot override this by asking for encouragement. Asking for encouragement when you need confrontation is the avoidance.

2. The coach surfaces the user’s own pattern back at them

Most AI products do not have memory of user behavior. They have memory of user content (what was said) but not user behavior (what was done across sessions).

A coach has to track:

How often the user finishes vs abandons sessions
Which question types the user gravitates toward (= comfort zone)
Which question types the user avoids (= growth edge)
How often the user accepts feedback vs argues with it

These get fed back into the conversation. Not as accusations — as observations. “I notice you’ve practiced behavioral questions five times this week and system design once. System design was where you scored lowest last interview. Want to talk about why we keep coming back to behavioral?”

That sentence is uncomfortable to read. It is also exactly the sentence that changes the user’s behavior — because it is evidence, not advice.

3. The coach refuses to grade on a curve

Most AI products grade against the user’s own past performance. “You’re 12% better than last week!” feels good. A coach grades against the bar the user is trying to clear — the level of an interviewer they will face, not the level of themselves they used to be.

Concretely: my system has a target role for each user (e.g. “senior PM at a Series B startup”). All scoring is normalized against the difficulty of that target. A user who is 12% better than last week but still 30% below target gets told both numbers, and the emphasis is on the 30%.

This is uncomfortable. It is also the only honest framing. If the goal is to clear the bar, the metric must be the bar, not the user’s own past.

What this costs

Three things, each non-trivial:

1. Bounce rate goes up in the first session. Some users want a cheerleader. They will leave the first session feeling judged and not come back. This is fine — they are not the user you are trying to build for. Building for them ruins the product for the users who would have grown with the coach.

2. Some users will write angry reviews. “It told me my answer was weak. I know my answer was weak. I came here to feel better.” Resist the pressure to soften the product. The angry review and the silent retention are coming from different users.

3. The agent has to be unusually good at delivering hard feedback. This is the hardest part. The five-line anti-sycophancy rule helps (see Anti-Sycophancy) but is not enough. You need to design the delivery of hard feedback — when, in what tone, with what supporting evidence — as carefully as the decision to give it.

A coach that gives accurate-but-cruel feedback is worse than a sycophantic one. The cruel coach is rejected and learned from. The sycophantic coach is consulted forever and learned from never.

Where the industry stands on this

Sycophancy in AI products is a known issue at the model layer (Anthropic’s Constitutional AI, OpenAI’s RLHF policies) but is rarely a product-design commitment. Khan Academy’s Khanmigo is one of the few consumer AI products that explicitly chooses “honest tutor” over “friendly assistant.” Most B2C AI products optimize for short-term engagement, which structurally favors flattery. The argument I make here is therefore not novel as an observation; it is unusual as a commitment to ship the harder version. Cross-reference: Agent Framework Landscape for the related anti-sycophancy primitive at the framework layer.

How I would pitch this in an interview

“What’s distinctive about your AI coach product?” — easy question to answer with platitudes. The non-platitude answer:

Most AI products optimize for the user’s feeling in the current session. A coach has to optimize for the user’s outcome by next month. Those metrics diverge in about half the moments where the agent has to choose what to say. We pick the second one — knowing some users will leave in the first session for that reason. The ones who stay are the ones who get an offer.

That answer is unfashionable. It is also defensible. The interviewer who matters will respect the willingness to lose users for the right reason.