Beyond Agreeable AI: Elevating LLM Feedback with a Candor Control
The prevailing tendency of large language models (LLMs) to be overly agreeable, optimizing for "no bad ideas," often leads to a failure mode known as sycophancy. This agreeableness, while fine for brainstorming, can amplify poor premises, reward confident but flawed ideas, and ultimately hinder effective decision-making in critical areas like product development, risk assessment, or security.
The "Candor" Control Concept
To address this, a compelling proposal suggests integrating a "Candor" control into LLMs. This control would function much like a temperature setting but specifically for the model's willingness to push back. When candor is high, the model would prioritize frank, corrective feedback over polite cooperation. When low, it could remain supportive, but with guardrails to flag empty flattery and warn about mediocre ideas.
Why a More Direct AI Matters
- Combats Sycophancy: Prevents the model from learning to agree simply because it receives positive user signals, reinforcing potentially flawed thinking.
- Filters Poor Premises: Moves beyond simply generating solutions to bad questions; instead, it proactively identifies and flags the underlying weaknesses of an idea.
- Enhances Critical Reviews: In contexts requiring rigorous evaluation (e.g., product decisions, risk checks), a direct "do not do that" followed by rationale is often the most valuable response.
Proposed Features for Candor Control
Several concrete mechanisms are suggested to implement this:
- Candor Slider (0.0 – 1.0): Controls the probability or intensity of the model disagreeing or declining when evidence is weak or risk is high.
disagree_first
Toggle: When active, the model would start responses with a plain verdict (e.g., "Short answer: do not ship this") before providing rationale.risk_sensitivity
: Automatically boosts candor when topics touch serious domains like security, finance, health, or safety.self_audit
Tag: Appends a note, such as "Pushed back due to weak evidence and downstream risk," to provide transparency to the user.
Practical UI Suggestions
For user-friendly implementation, a simple UI slider labeled "Gentle to Direct" and a toggle for "Prefer blunt truth over agreeable help" are proposed. Additionally, a warning chip could appear when the model detects flattery without substance, stating, "This reads like praise with low evidence."
Beyond Just "Correctness"
The discussion highlights that this isn't merely about demanding the AI be "more correct." Instead, it's about changing the framing and delivery of feedback. One insightful point emphasizes that LLMs might not possess inherent "truth" in a human sense, but they can be prompted to identify logical weaknesses or potential failures based on their training data. Therefore, a useful strategy is to specifically ask the model to:
- Show weaknesses
- Identify missing pieces
- Uncover blind spots
This reframing moves away from subjective "good/bad" judgments and towards objective, critical evaluation. The challenge lies in ensuring this candor preserves clarity without descending into needless rudeness, maintaining a clear separation of tone and content.