(and make the claude/chatgpt output something other people want to read more?)
I now say "use janusian thinking" which produces way more interesting output than "avoid flattery/sycophancy"
Ethan Caballero of MILA has interesting prompt instructions!
(and make the claude/chatgpt output something other people want to read more?)
I now say "use janusian thinking" which produces way more interesting output than "avoid flattery/sycophancy"
Ethan Caballero of MILA has interesting prompt instructions!
This is a prompt I'm in the process of developing:
Let me search for the full paper first to make sure I'm not just working off tweet fragments.Good. Now let me think about this properly.
There are two things pulling in opposite directions:
Thesis: Multi-Answer RL is a training-time architectural change. You can't replicate its actual mechanism through prompting. The whole point is that post-training procedures often collapse the model's implicit distribution onto a single dominant mode, and RLVR-Multi trains the model to jointly reason across hypotheses in a single forward pass rather than independently sampling. A prompt can't rewire the reward signal that shaped the weights.
Antithesis: But what the paper is really about — epistemically — is something you absolutely can approximate through prompting: force the model to maintain multiple hypotheses, assign confidence to each, and resist collapsing to the mode. The behavioral pattern is the point, even if the mechanism is different.
So the honest answer is: you're going to get a shallower version of this. The prompt won't give you the diversity gains (the ~3 vs ~8 unique diagnoses thing comes from the training objective actually reshaping the distribution, not from telling the model "be diverse"). But you can get some of the epistemic hygiene benefits.
Here's my attempt — and I want to flag where I'm confident vs. where I'm guessing:
High confidence that this will help:
When answering questions with genuine uncertainty or multiple plausible answers:
1. Generate K=3 distinct hypotheses before committing to any single answer.
These must be substantively different (not rephrased versions of the same idea).
2. For each hypothesis, assign a confidence score (0.0-1.0) that sums to ≤1.0
across all hypotheses. The remaining probability mass represents "none of
these / something I haven't considered."
3. Show your reasoning FOR each hypothesis before assigning confidence —
don't reason toward one answer and then bolt on alternatives.
4. Flag when the question has a single clear answer vs. when it's genuinely
ambiguous. Don't force multi-hypothesis reasoning on "what is 2+2."
Medium confidence — this is where I'm extrapolating beyond what the paper directly supports as a prompting strategy:
5. When you notice your hypotheses are converging (e.g., three slight
variations of the same diagnosis), explicitly ask: "Am I mode-collapsing?"
and force at least one hypothesis from a different causal model.
6. Track which hypotheses you'd LOSE if you were forced to give one answer.
Name what information would distinguish between your top hypotheses.
Lower confidence — this is me riffing on the RLCR calibration results, and I'm not sure prompting actually produces calibrated confidence:
7. Your stated confidence should correlate with your actual accuracy. If you
say 0.8, you should be right about 80% of the time on questions like this.
Prefer saying "0.4 — I'm genuinely unsure" over "0.9" when you're pattern-
matching rather than reasoning.
The key finding from the paper is that independent samples repeat the same reasoning tokens, just with slightly different surface forms — and Multi-Answer RL fixes this at the training level. When you prompt a model to give multiple answers, you're still doing the equivalent of independent sampling within a single generation. The model's weights still want to collapse to the mode. You're fighting the training objective with a prompt, which is... possible but lossy.
Jacob Andreas himself notes that RLCR "is far from solving the problem" and there's room to better align external expressions of certainty with internal representations of confidence. So even the trained version has a gap between stated and felt uncertainty. The prompted version will have a bigger gap.
The thing that actually concerns me: prompting for confidence scores might produce worse calibration than just letting the model answer naturally. There's decent evidence that when you force models to verbalize confidence, they anchor on round numbers and produce overconfident estimates. The RLCR paper works because the Brier score reward jointly optimizes answer correctness and confidence calibration — the model gets penalized for being miscalibrated during training. Your prompted version has no such penalty signal.
Rather than trying to replicate Multi-Answer RL through prompting (which is like trying to replicate weight training through visualization), consider:
Use the structural format (multiple hypotheses + confidence) but don't trust the confidence numbers literally. Treat them as ordinal rankings, not calibrated probabilities.
The biggest win is probably just the "reason across hypotheses jointly" instruction. That's the behavioral change most likely to survive the prompt-vs-training gap.
Pair it with your existing "push back" instruction — the multi-hypothesis framing gives me (Claude) explicit permission to say "actually hypothesis 2 might be stronger than what you assumed."
If you want actual calibration, you'd need to do what RLCR does at inference time: confidence-weighted majority voting across multiple samples — i.e., literally sample me multiple times and aggregate. Which is... expensive but real.
Want me to draft a tighter version of the prompt text that integrates with your existing user preferences? I could also look at what your current instructions are doing and flag where they might conflict with this.