Kyle DeSana
- Apr 19
- 4 min read

Optimizing Prompt Engineering: ProTeGi's Textual Gradient Descent and Beam Search

LLMs; the Birth Of Prompt Engineering

As natural language processing (NLP) continues to advance, Large Language Models (LLMs) have become increasingly powerful tools for a wide range of applications. However, the performance of these models heavily relies on the quality of the prompts used to guide their output. Prompt engineering, the process of designing and optimizing prompts, has emerged as a critical skill in the NLP community. But manual prompt optimization can be time-consuming and requires significant expertise.

In October of 2023, a groundbreaking paper titled "Automatic Prompt Optimization with 'Gradient Descent' and Beam Search" introduced a novel method called ProTeGi that automates the prompt optimization process. ProTeGi adapts the concept of gradient descent, a fundamental optimization technique in machine learning, to work with natural language prompts. My goal is to break it down in simpler terms, but if you're interested you can read the full article here.

What On Earth Is Gradient Descent?

Simply put, in machine learning, gradient descent is used to minimize errors by adjusting model parameters in the direction that reduces the error the most. The model's parameters are iteratively adjusted based on gradients that indicate the direction of steepest ascent or descent in the loss landscape.

ProTeGi applies this concept to prompts by using "textual gradients" instead of numeric ones. These textual gradients are generated by evaluating a prompt's performance on a training dataset and describing its flaws or areas for improvement in natural language.

Example of Textual Gradient Descent

If a prompt for sentiment analysis fails to correctly classify certain examples, ProTeGi might generate a textual gradient like: "The prompt struggles with negation and sarcasm, leading to misclassifications." Based on this gradient, ProTeGi would then edit the prompt to address these issues, such as by adding examples of negation and sarcasm to the prompt. This process is repeated iteratively, with each edit guided by the textual gradients, gradually refining the prompt.

Beam Search

To efficiently explore the space of possible prompt variations, ProTeGi employs beam search, a search algorithm that maintains multiple candidate prompts and selects the most promising ones to expand further. So we started with an initial prompt and generated multiple candidate prompts by editing the current prompt based on the textual gradients. We then select the most promising candidates to continue the search in the next iteration.

This allows ProTeGi to consider a diverse set of prompt modifications while focusing on the most effective ones.

Bandit Search

ProTeGi also incorporates bandit selection algorithms, which balance the exploration of new prompt variations with the exploitation of the best-performing ones discovered so far. It uses this step to select the best candidate prompts without evaluating all of them on the entire training dataset (which can be expensive). These algorithms balance exploration (trying out different prompts) and exploitation (focusing on the most promising prompts) to efficiently identify the best candidates.

This helps the method efficiently navigate the vast search space of possible prompts without getting stuck in suboptimal regions. It completes the cycle, and allows the system to loop with the most optimized candidates.

Visual Representation of The Prompt Engineering Optimization Flow with ProTeGI

Below are 2 figures from the article, showing an example of how ProTeGi works.

prompt engineering optimization workflow

Summary of How ProTeGi Works

Start with an initial prompt: Begin with a starting prompt for a specific natural language processing task, such as sentiment analysis or named entity recognition.
Evaluate the prompt: The initial prompt is tested on a training dataset to see how well it performs. This helps identify the prompt's strengths and weaknesses.
Generate textual gradients: Based on the prompt's performance, generate "textual gradients." These are essentially short, human-readable descriptions of what the prompt is doing wrong or could be doing better.
Edit the prompt: Using the textual gradients as a guide, ProTeGi automatically edits the prompt to address its weaknesses. For example, if the gradient says the prompt struggles with negation, ProTeGi might add examples of negation to the prompt.
Generate multiple candidate prompts: ProTeGi doesn't just create one edited prompt, but several variations. This allows it to explore different possible improvements.
Select the most promising candidates: From the multiple edited prompts, ProTeGi selects the ones that seem most likely to perform well based on certain criteria. This is called beam search.
Repeat steps 2-6: ProTeGi then repeats the process of evaluating, generating gradients, editing, and selecting candidates for each of the chosen prompts. This iterative process continues, with each loop refining the prompts further.
Balance exploration and exploitation: As ProTeGi searches for the best prompt, it balances trying out new edits (exploration) with focusing on the edits that have already proven effective (exploitation). This is done using bandit selection algorithms.
Final output: After multiple iterations, ProTeGi outputs the prompt that performs best on the training data. This optimized prompt is ready to be used for the intended natural language processing task.

ProTeGI Prompt Optimization Results

The experimental results presented in the paper are impressive, with ProTeGi outperforming other state-of-the-art prompt optimization methods on a range of NLP tasks, including sentiment analysis, named entity recognition, and natural language inference. In some cases, ProTeGi was able to improve the initial prompt's performance by up to 31% while requiring fewer LLM API calls compared to other methods.

However, it appears as if it's possible to overfit the data (see image below).

"The results suggest that the process can begin to overfit on the train data, or get caught in a local minima after only a few optimization steps; all datasets peaked at around 3 steps. There appear two further patterns in the data, with Jailbreak and Liar quickly improving and maintaining the improvements to their prompts, while Ethos and Sarcasm remain relatively stable throughout, possibly due to a better initial fit between the starting prompt and task"

Conclusion

The implications of this research are significant. By automating the prompt engineering process and making it more accessible, ProTeGi could democratize the use of LLMs and enable a wider range of practitioners to benefit from their capabilities. It also opens up new avenues for research, such as exploring different types of textual gradients or incorporating domain-specific knowledge into the optimization process.

If you're interested in leveraging LLMs for your own projects or applications, I highly recommend diving deeper into the ProTeGi paper and considering how you might apply these techniques to your own work. As always, I welcome any questions, feedback, or discussion on this exciting development in the field of prompt engineering.

The future of intelligence software