paper review: “Large Language Models are Zero-Shot Reasoners”


key idea

  • add “let’s think step by step” as zero-shot prompt to LLM input to get better results!

Background and Proposed Method

These days the paradigm for using large language models(LLM) are not transfer learning but rather using prompts. This is because due to its huge size its impractical to trying finetuning it to downstream task, and also since LLM’s are already trained well enough to handle downstream task prompts. Previously prompts were either given in few-shot or zero-shot manner.

But this work proposes to use “chain of thought(COT)” prompting in a zero-shot manner for multi-step, logical reasoning tasks. This difference is well shown in the following figure.

The trick is to simply add “Let’s think step by step” at the end of prompt and LLMs output a step by step approach which eventually leads to the correct answer. While the LLM struggles to output the answer without any output related prompts, add this magical string seems to make the LLM do the task properly. This approach is named “zero-shot-COT”.

How it works

“Zero-shot-CoT” works in two stages.

reasoning extraction

The first prompting is done by adding “let’s think step by step”. Then the LLM will give out some output.

answer extraction

The output text from the first step contains a lot of steps and we want to extract only the answer from it. There fore we provide the full text that we generated until now, and add an answer format triggering string at the end and then give this to the LLM. The output text is expected to be the answer.


experimented tasks

  • arithmetic reasoning
  • commonsense reasoning
  • symbolic reasoning
  • other logical reasoning tasks

Many LLM models varying in size has been tested.


  • comparing ‘zero shot’ and ‘zero shot cot”: zero shot cot performs better in most experiments but not in commonsense reasoning tasks. In this case zero shot COT fails but it does produce logically — understandable mistakes.
  • zero shot COT mostly underperforms than few shot COT, it performs better than few shot prompting.
  • COT is effective when model size is larger. When the model size is small, COT may actually decrease performance.
  • The authors experimentsed with different COT triggering sentences, and “Let’s think step by step” worked best.



Deep Learning Engineer LinkedIn:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store