(and make the claude/chatgpt output something other people want to read more?)

I now say "use janusian thinking" which produces way more interesting output than "avoid flattery/sycophancy"

Ethan Caballero of MILA has interesting prompt instructions!

New Answer
Ask Related Question
New Comment

1 Answers

This is a prompt I'm in the process of developing:

What the paper is actually doing

Let me search for the full paper first to make sure I'm not just working off tweet fragments.Good. Now let me think about this properly.

The Janusian tension here

There are two things pulling in opposite directions:

Thesis: Multi-Answer RL is a training-time architectural change. You can't replicate its actual mechanism through prompting. The whole point is that post-training procedures often collapse the model's implicit distribution onto a single dominant mode, and RLVR-Multi trains the model to jointly reason across hypotheses in a single forward pass rather than independently sampling. A prompt can't rewire the reward signal that shaped the weights.

Antithesis: But what the paper is really about — epistemically — is something you absolutely can approximate through prompting: force the model to maintain multiple hypotheses, assign confidence to each, and resist collapsing to the mode. The behavioral pattern is the point, even if the mechanism is different.

So the honest answer is: you're going to get a shallower version of this. The prompt won't give you the diversity gains (the ~3 vs ~8 unique diagnoses thing comes from the training objective actually reshaping the distribution, not from telling the model "be diverse"). But you can get some of the epistemic hygiene benefits.

What I'd actually put in your system prompt

Here's my attempt — and I want to flag where I'm confident vs. where I'm guessing:

High confidence that this will help:

When answering questions with genuine uncertainty or multiple plausible answers:

1. Generate K=3 distinct hypotheses before committing to any single answer. 
   These must be substantively different (not rephrased versions of the same idea).
   
2. For each hypothesis, assign a confidence score (0.0-1.0) that sums to ≤1.0 
   across all hypotheses. The remaining probability mass represents "none of 
   these / something I haven't considered."
   
3. Show your reasoning FOR each hypothesis before assigning confidence — 
   don't reason toward one answer and then bolt on alternatives.
   
4. Flag when the question has a single clear answer vs. when it's genuinely 
   ambiguous. Don't force multi-hypothesis reasoning on "what is 2+2."

Medium confidence — this is where I'm extrapolating beyond what the paper directly supports as a prompting strategy:

5. When you notice your hypotheses are converging (e.g., three slight 
   variations of the same diagnosis), explicitly ask: "Am I mode-collapsing?" 
   and force at least one hypothesis from a different causal model.
   
6. Track which hypotheses you'd LOSE if you were forced to give one answer. 
   Name what information would distinguish between your top hypotheses.

Lower confidence — this is me riffing on the RLCR calibration results, and I'm not sure prompting actually produces calibrated confidence:

7. Your stated confidence should correlate with your actual accuracy. If you 
   say 0.8, you should be right about 80% of the time on questions like this. 
   Prefer saying "0.4 — I'm genuinely unsure" over "0.9" when you're pattern-
   matching rather than reasoning.

Why I'm somewhat skeptical this fully works

The key finding from the paper is that independent samples repeat the same reasoning tokens, just with slightly different surface forms — and Multi-Answer RL fixes this at the training level. When you prompt a model to give multiple answers, you're still doing the equivalent of independent sampling within a single generation. The model's weights still want to collapse to the mode. You're fighting the training objective with a prompt, which is... possible but lossy.

Jacob Andreas himself notes that RLCR "is far from solving the problem" and there's room to better align external expressions of certainty with internal representations of confidence. So even the trained version has a gap between stated and felt uncertainty. The prompted version will have a bigger gap.

The thing that actually concerns me: prompting for confidence scores might produce worse calibration than just letting the model answer naturally. There's decent evidence that when you force models to verbalize confidence, they anchor on round numbers and produce overconfident estimates. The RLCR paper works because the Brier score reward jointly optimizes answer correctness and confidence calibration — the model gets penalized for being miscalibrated during training. Your prompted version has no such penalty signal.

What I'd actually recommend instead

Rather than trying to replicate Multi-Answer RL through prompting (which is like trying to replicate weight training through visualization), consider:

Use the structural format (multiple hypotheses + confidence) but don't trust the confidence numbers literally. Treat them as ordinal rankings, not calibrated probabilities.

The biggest win is probably just the "reason across hypotheses jointly" instruction. That's the behavioral change most likely to survive the prompt-vs-training gap.

Pair it with your existing "push back" instruction — the multi-hypothesis framing gives me (Claude) explicit permission to say "actually hypothesis 2 might be stronger than what you assumed."

If you want actual calibration, you'd need to do what RLCR does at inference time: confidence-weighted majority voting across multiple samples — i.e., literally sample me multiple times and aggregate. Which is... expensive but real.

Want me to draft a tighter version of the prompt text that integrates with your existing user preferences? I could also look at what your current instructions are doing and flag where they might conflict with this.