Improving LLM’s Reasoning In Production - The Structured Approach

This guide on achieving better reasoning performance of LLMs complements our guide on prompt engineering which explains how to improve LLM setups by systematic and programmatic means to enable flexibility while keeping the amount of operational variability to a minimum. Of course, the language component of a prompt can also be inspected and improved which is called In-context learning (ICL). There are a lot of papers on how to achieve better LLM reasoning performance through this technique that is all about the choice of words and their sequence in a prompt. For anyone interested, the known NLP-technique Retrieval-Augmented Generation (RAG) is a sub technique within the concept of ICL. Let’s dive into an overview of some interesting word games that have been unearthed so far.

Chain of Thought (CoT) Prompting (Zero shot and Few shot Examples)

Chain of Thought Zero Shot Example Prompt from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac2efa338e7d6eb96ab22_Untitled (6).png

Chain of Thought Few Shot Example Prompt from the paper:

Zero-shot CoT uses special instructions to trigger CoT prompting without any examples. For example, the instruction “Let’s think step by step” works well for many tasks. The final answer is extracted from the last step of the thinking process.

Few-shot CoT uses some examples of questions and answers with reasoning chains to guide LLMs to use CoT prompting. The examples are given before the actual question. The final answer is also extracted from the last step of the thinking process.

In a nutshell, CoT prompting improves the performance of LLMs on many tasks, especially math and logic problems. However, it does not work for all tasks and sometimes gives wrong or multiple answers.

Reflexion Prompting (Reflect on x first and then reply)

Reflexion Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac29da4b01c81f001a769_Untitled (1).png

Reflexion works by making the agents verbally reflect on the feedback they receive from the tasks, and use their own reflections to improve their performance. Reflexion does not require changing the model weights, and can handle different kinds of feedback, such as numbers or words. It outperforms baseline agents and is especially effective for tasks such as making decisions, writing code. Because the LLM will flag the output as “reflected”, it is not very useful in vanilla settings for use cases where being perceived as human is important for the success of the product.

Directional Stimulus Prompt (Do x while taking y related to x into account)‍

Directional Stimulus Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac2aa9cd8d1e54d9fc174_Untitled (2).png

Directional stimulus prompts act as instance-specific additional inputs to guide LLMs in generating desired outcomes, such as including specific keywords in a generated summary. It’s very simple and straightforward. Like Prompt Alternating, you are most likely already using this technique in your setups already!

Tree of Thoughts (ToT) Prompting (Multiple experts vote on the right answer)

Tree of Thoughts Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac3adf60de0f185288b7b_Untitled (8).png

Tree of Thoughts technique is effective for solving logic problems, such as finding the most likely location of a lost watch.It asks the LLM to imagine three different experts who are trying to answer a question based on a short story. The LLM has to generate the steps of thinking for each expert, as well as their critiques and likelihoods of their assertions and it also has to follow the rules of science and physics, and backtrack or correct itself if it finds a flaw in its logic. ToT is fun for riddles and very specific problems. It’s quite an academic technique and only useful for very specific and conversational AI use cases.

Reasoning And Acting (ReAct) Prompting (Combining reasoning step with possibility to act)

ReAct Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac35a5bf4d19bcbe68207_Untitled (7).png

ReAct prompts consist of four components: a primary prompt instruction, ReAct steps, reasoning thoughts, and action commands. Setting up relevant sources of knowledge and APIs that can help the LLM perform the actions is vital. We  do not recommend using ReAct in high-stakes environments.

Reasoning WithOut Observation (ReWOO) Prompting (Experts with clearly separated roles, eg. Planner and Solver)

Reasoning WithOut Observation Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac32d5d6af89f75fe18ad_Untitled (4).png

ReWOO performs better than ReAct despite not relying on current and previous observations. ReAct suffers from tool failures, action loops, and lengthy prompts, while ReWOO can generate reasonable plans but sometimes has incorrect expectations or wrong conclusions. Improving the tool responses and the Solver prompt to enhance the reasoning performance is vital to have good ReWOO performance.

CoH is easy to optimize and does not rely on reinforcement learning or reward functions, unlike previous methods.

Chain of Density (CoD) Prompting (Summarizing recursively and keeping equal word length)

Chain of Density Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac3413c2b2cc68da6baf5_Untitled (5).png

CoD is a big deal. We think that semantic compression is the golden goose of AI, a technique that will keep on giving.

Here’s what the prompt instructs the LLM to do:

  • Identify informative entities from the article that are missing from the previous summary.
  • Write a new summary of the same length that covers every entity and detail from the previous summary plus the missing entities.
  • Repeat these two steps five times, making each summary more concise and entity-dense than the previous one.

The purpose of this CoD is to write effective summaries that capture the main points and details of a text recursively, making output more dense every iteration. It uses fusion, compression, and removal of uninformative phrases to make space for additional entities. Of course, the output can be saved in convenient JSON-format for further processing and function calling.

How To Improve Reasoning For LLMs?

Each approach has different advantages and disadvantages. You need to tinker and work your way up to achieve alignment of your use case and the input/output of your setup. So which technique is best? In our view, CoT, CoD, Directional Stimuli and good old RAG are the way to build chatbots. ReAct and ReWOO are more experimental and could be used for content generation.

Each of these techniques offers unique ways to enhance the performance of LLMs, making them more flexible, reliable, and context-aware. They are all based on the principle of prompt engineering, which involves the strategic use of prompts to guide the model's responses pragmatically and with minimal variability.

Start Your Project Today

If this work is of interest to you, then we’d love to talk to you. Please get in touch with our experts and we can chat about how we can help you.

Send us a message and we’ll get right back to you. ->

Read On