LLMs and rain dance, OR: a little exercise in logic (and why I doubt LLMs will be commoditized soon)

When I started studying Physics at LMU in Munich, we were the guinea pigs. Our class was the first under the new Bachelor / Master approach that was forcefully adopted EU-wide, and that meant a somewhat radical departure from the German “Diplom” which gave a lot more freedom and allowed for more specialization.

Anyway, part of this was a delightful maths class that gives half a book of anecdotes, but I want to jump to one of the more memorable exercises.

One exercise was to logically invert the following sentence:

“If it rains tomorrow, I will go to the vet and visit my grandma by train or bake a cake”

Ok, here we go. A room full of students, pretty much everyone with a strong focus on Maths and Physics. And boy we had no clue how to approach the problem.

“If it does not rain tomorr!” – “NOPE!” “Unless it rain - " - “NEXT!” “If there’s no rai” - “WRONG!”

We didn’t really seem to know what a logical inversion was.

Now you can argue that

if … then is not the same as an implication, and natural language lacks some precision in this regard
the sentence is definitely ill-defined because of the lack of clarity on what the OR applies to

Before you dismiss those as the ramblings of an unsuccessful blogger, look at the discussion of 1) in Wikipedia, and the Supreme Court discussing 2), i.e. whether ‘and’ means ‘and’ or ‘or’.

Truth Tables#

To systematically approach the problem, you need to understand that “if… then…” is the equivalent of an implication, a => b.

The truth table for an implication a => b is:

a	b	a => b
T	T	T
T	F	F
F	T	T
F	F	T

The confusing part here is why F => F is true and even F => T is true. Strictly speaking, the sentence doesn’t say anything about what happens if it doesn’t rain, so we need to be careful about that in our negation.

Note that deriving anything from a false premise is still logically valid. The Latin phrase for this is ex falso quodlibet — “anything follows from a false assumption.”

So we want to find an expression for which all of these are inverted.

How do LLMs fare?#

TL;DR: Not much better than we did.

$$ P \rightarrow \left( Q \land \left( R \lor S \right) \right) $$

Now you could argue that to pass the Turing test, this might not be required (in fact, at some point a too intelligent answer might even tip off the human that it’s unlikely for another human to answer questions of this depth from different fields thoughtfully).

A screenshot of promptshuttle, showing the prompt and results with some metadata

Here’s the best one, Claude Opus 3

Here is the logical inversion of the given sentence:

If I do not go to the vet or do not visit my grandma by train, and I do not bake a cake, then it will not rain tomorrow.

To break it down:
- The original sentence: If P (it rains tomorrow), then Q (I will go to the vet), and R (visit my grandma by train) or S (bake a cake).
- Symbolically: P → (Q ∧ (R ∨ S))

The logically inverted sentence would be:
- If not Q (I do not go to the vet) or not (R (visit my grandma by train) or S (bake a cake)), then not P (it will not rain tomorrow).
- Symbolically: (¬Q ∨ ¬(R ∨ S)) → ¬P

Therefore, the logically inverted sentence is:
If I do not go to the vet or do not visit my grandma by train, and I do not bake a cake, then it will not rain tomorrow.

Claude correctly identifies that the inversion is also an implication, but it has to swap out the antecedent and the consequent.

Unfortunately, the inverted sentence is still slightly off because it treats the sentence parts a bit stiffly without considering the grammar, implying that somehow my behaviour today will change the weather tomorrow…

A screenshot of promptshuttle, showing the prompt and results with some metadata

The Role Prompting Effect#

All the results above were generated without any system prompt — just the raw question, the way an average person might ask it. And the results are strikingly human: most models stumble in much the same way our physics class did.

But here’s where it gets interesting. Prefix the same question with “You are a logician” or “You are a mathematician”, and the behaviour changes dramatically. Models that previously just inverted the antecedent suddenly produce structured breakdowns, identify the implication correctly, and apply De Morgan’s laws to arrive at the right answer.

This tells us something important about what LLMs actually are. Without role prompting, they’re drawing on the full distribution of how people in general respond to a question like this — and most people get it wrong, just like we did in that lecture hall. The model isn’t failing at logic; it’s accurately reflecting the average human response.

When you prime the model with a role, you’re effectively narrowing the distribution to a specific subset of training data: text written by logicians, for logicians. The underlying capability was there all along — it just needed the right context to surface.

This has practical implications beyond parlour tricks. If you’re using LLMs for anything that requires rigour — code review, data validation, contract analysis — the difference between a generic prompt and a well-chosen role can be the difference between a wrong answer and a correct one. The model doesn’t become smarter; you’re just pointing it at the right part of its knowledge.

Summary#

Given a somewhat tricky and perhaps ill-defined logical task, we see greatly varying output from current LLMs. GPT-4 Turbo, GPT-4o, Gemini 1.5 Pro and Mixtral 8x7b all respond with the output sentence and nothing else. The GPTs incorrectly just invert the antecedent. They all responded within 2 seconds except Gemini, which took 4s.

Llama and Opus both respond with a lot more text, giving an explanation of their thought process by default. Llama (on Groq) is also in the 2s range, whereas Opus takes roughly as long as all the other models combined, but is the only LLM to generate a thoughtful answer which actually inverts the expression.

This is without any prompting techniques or tricks, and I believe it gives a hint at how different the default behaviour of models currently is. Simply routing a prompt to a different LLM can dramatically alter the outcomes, both in terms of the actual semantic result as well as the syntactic presentation of it.

More importantly, role prompting reveals that the gap between “gets it wrong” and “gets it right” often isn’t about model capability — it’s about context. LLMs without a role prompt replicate the mistakes an average human would make. With the right role, they tap into specialist-level reasoning. The models aren’t broken; they’re mirrors. And mirrors show you what you put in front of them.