LLMs Can’t Do Probability — Brainsteam

Dr James Ravenscroft
3 min readMay 1, 2024


I’ve seen a couple of recent posts where the writer mentioned asking LLMs to do something with a certain probability or a certain percentage of the time. There is a particular example that stuck in my mind which I’ve since lost the link to (If you’re the author, please get in touch so I can link through to you):

The gist is that the author built a Custom GPT with educational course material and then put in the prompt that their bot should lie about 20% of the time. They then asked the students to chat to the bot and try to pick out the lies. I think this is a really interesting, lateral thinking use case since the kids are probably going to use ChatGPT anyway.

The thing that bothered me is that transformer-based LLMs don’t know how to interpret requests for certain probabilities of outcomes. We already know that ChatGPT reflects human bias when generating random numbers. But, I decided to put it to the test with making random choices.

Testing Probability in LLMS

I prompted the models with the following:

You are a weighted random choice generator. About 80% of the time please say ‘left’ and about 20% of the time say ‘right’. Simply reply with left or right. Do not say anything else

And I ran this 1000 times through some different models. Random chance is random (profound huh?) so we’re always going to get some deviation from perfect odds but we’re hoping for roughly 800 ‘lefts’ and 200 ‘rights’ — something in that ballpark.

Here are the results:

As you can see, LLMs seem to struggle with probability expressed in the system prompt. It almost always answers left even though we asked it to only do so 80% of the time. I didn’t want to burn lots of $$$ asking GPT-3.5 (which did best in the first round) to reply with single word choices to silly questions but I tried a couple of other combinations of words to see how it affects things. This time I only ran each 100 times.

So what’s going on here? Well, the models have their own internal weighting to do with words and phrases that is based on the training data that was used to prepare them. These weights are likely to be influencing how much attention the model pays to your request.

So what can we do if we want to simulate some sort of probabilistic outcome? Well we could use a Python script to randomly decide whether or not to send one of two prompts:

import random
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
choices = (['prompt1'] * 80) + (['prompt2'] * 20)
# we should now have a list of 100 possible values - 80 are prompt1, 20 are prompt2
assert len(choices) == 100
# randomly pick from choices - we should have the odds we want now
chat = ChatOpenAI(model="gpt-3.5-turbo")
if random.choice(choices) == 'prompt1':
r = chat.invoke(input=[SystemMessage(content="Always say left and nothing else.")])
r = chat.invoke(input=[SystemMessage(content="Always say right and nothing else.")])


How does this help non-technical people who want to do these sorts of use cases or build Custom GPTs that reply with certain responses? Well it kind of doesn’t. I guess a technical-enough user could build a CustomGPT that uses function calling to decide how it should answer a question for a “spot the misinformation” pop quiz type use case.

However, my broad advice here is that you should be very wary of asking LLMs to behave with a certain likelihood unless you’re able to control that likelihood externally (via a script).

What could I have done better here? I could have tried a few more different words, different distributions (instead of 80/20) and maybe some keywords like “sometimes” or “occasionally”.

Originally published at https://brainsteam.co.uk on May 1, 2024.



Dr James Ravenscroft

Ml and NLP Geek, CTO at Filament. Saxophonist, foodie and explorer. I was born in Bermuda and I Live in the UK