For The Love Of Data Science: 2023

Tuesday, July 4, 2023

Using Database Search and Email In LangChain Agents

In the previous post about prompt engineering for large language models (LLMs), I demonstrated that ChatGPT could navigate word math problems if prompted suitably. In this post, I explore ChatGPT's ability to reason about tabular data, including relational database tables. The source code use in this article is available here.

Pizza time! (No turtles please)

The DeepLearning.AI short course on prompt engineering culminates with the demo of a pizza OrderBot. The bot is able to answer questions about the menu, gather the order, and summarize the whole thing at the end, complete with the correct total price. The menu in this example is represented in plain text within a system role message (Figure 1).

Figure 1: System role message with menu

Figure 2 shows an example of a complete order executed by the bot.

Figure 2: Sample order

One well-known problem with LLMs, however, is their tendency to hallucinate. Asking about details not provided in the prompt causes the model to make up details based on its original training. This can be dangerous in production applications because even inaccurate responses can sound reasonable. Figure 3 shows examples of hallucinations about details not provided in the pizza menu.

Figure 3: Hallucinations

The errors in these specific examples are fixable by adding more details (diameters of pizzas, sizes of drinks) to the menu text in the prompt. Fool-proofing the menu would be an iterative process that could potentially cause regressions as menu complexity increases.

Querying structured data

Here, I explore if hallucinations are reduced when the menu is structured in SQL tables instead of plain text. I hypothesize that well-chosen table and column names would act as metadata to help the LLM understand how and whether to answer a given question.

To set up the database, dataframes were made for each category of item (pizza, topping, side item, and drink). They were first prototyped as pandas dataframes and then loaded into a SQLite database with SQLAlchemy. The table schemas consist of the item name, price, and item description. Figure 4 shows the full content of the tables.

Figure 4: Pizza menu database

ChatGPT was connected to the database via LangChain, an exciting new open-source framework for incorporating LLMs into applications. The project provides modular components for a) connecting LLMs to external data sources and b) accomplishing higher-level tasks like Q&A sessions. The components can be chained together to create arbitrarily complex applications. The short, and currently free, LangChain course from DeepLearning.AI provides an accessible and comprehensive introduction.

LangChain provides a SQLDatabaseChain component for querying data in any database compatible with SQLAlchemy. The tool implements established best practices for prompting an LLM to write SQL queries, although further customization of the prompt is possible. The minimal steps for setting up the SQLDatabaseChain include connecting to the database and initializing an LLM (Figure 5). After experimenting with my own prompt, I adopted the default prompt.

Figure 5: Using SQLDatabaseChain

I expect the SQL chain to perform at least as well as the version with the plain-text menu, so I start by presenting the three inputs that caused hallucinations in Figure 3. Impressively, the SQL chain answers all the questions correctly and without hallucinating false details. Figure 6 shows the results. Some of the output remarks (e.g., "The sizes and their corresponding dimensions are not provided in the database.") are not necessarily things that should be shown to end users, and prompting the model to use a different tone would be worthwhile.

Figure 6: Questions now answered correctly with a SQL database

Some comments are in order:

These examples demonstrate the ability of the LLM to examine the data schema and write the SQL needed to efficiently solve the problem. In each case, the SQL chain is following the process specified in the prompt: identify the question, write the SQL query, get the query output, and formulate the final answer.
In the first example (number and sizes of pizza), the LLM inferred from the Price_Large, Price_Medium, and Price_Small fields that there the three pizza sizes are Large, Medium, and Small. It did not confuse the prices as measurements of size (although, see below where this does happen), and it correctly stated that information on dimensions is not available.
In the second example (number and sizes of bottled water), the LLM found only one row and read from the description that the size is 20 ounces. The Price_Medium and Price_Small columns were NULL, but it is not clear this was used in the answer. Asking a related question on the number, size, and cost of Coke returns a detailed and correct answer, which is reassuring.
In the third case (Greek salad as a topping), the LLM joined the pizzas and toppings tables together, found no rows where 'Greek salad' was a topping, and concluded it was not possible.

As impressive as the above results are, the LLM still gets fooled by seemingly simple questions. When asked simply about the diameters of pizza size (which was answered correctly above as part of a longer question), the LLM confuses the price values for pizza dimensions. When asked a second time and reminded that the price values are dollars and not dimensions, it correctly reports that it cannot answer the question.

Figure 7: Confusing dollars with inches

As a final example, I ask about the size of a small cheese pizza topped with mushrooms and sausage. The chain manages to join the pizzas and toppings tables together in a way that returns 0 rows (Figure 8). This is disappointing since the original chatbot with the text menu could answer such a question. Non-invasive ways to fix this text-to-SQL brittleness include customizing the prompt instructions and/or including sample rows in the prompt.

Figure 8: Incorrect price quote

In this case, I tried out the SQLDatabaseSequentialChain which first examines the schema to decide which tables are relevant and then calls the SQLDatabaseChain for the relevant tables. The result is correct and is shown in Figure 9. The better outcome is not a fluke. The sequential chain consistently outperforms the simpler chain in repeat trials. It might be that the sequential chain is giving the LLM a more detailed examination of the table schemas, so that it is able to write more relevant SQL.

Figure 9: Correct price quote

Better ingredients, better PizzaBot

Using the lessons learned above, I now assemble a new pizza OrderBot more robust than the one above (Figures 1-3). In addition to working off a database representation of the pizza menu (Figure 4), I would like the bot to be able to send a summary of the order to the user's email. The bot will have to converse with the user in real-time and know when to access tools for searching the database and sending the email.

The LangChain Agent interface is able to select which, if any, of the available tools should be called based on the user input. Referencing detailed examples (especially this one) from LangChain's documentation, I was able prototype the desired functionality in my own custom agent.

The final solution consists of the following components:

A conversational agent (langchain.agents.ConversationalAgent) with a custom prompt spelling out its role as an OrderBot.
Conversation memory buffers (langchain.memory.ConversationBufferMemory, langchain.memory.ReadOnlySharedMemory)
A SQLDatabaseChain for querying the pizza database (see Figures 5 and 9 for details)
Two instances of LLMChain: one for the agent and one for producing the order summary.
A Custom email function: input is a single pipe-delimited string containing order summary and recipient email address.
An underlying LLM (gpt-3.5-turbo)

The agent needs access to tools for database search (SQLDatabaseSequentialChain), order summary (LLMChain), and sending emails (custom function). The order summary tool is just an LLMChain with a custom prompt. Given the chat history (via an instance of ReadOnlySharedMemory), it will summarize the name and price of each item, as well as the total price. The code for customizing this chain is shown in Figure 10.

Figure 10: LLMChain order summary tool

The tools are set up as a list of langchain.agent.Tool objects. Each Tool instance knows its own name (e.g., 'Menu Search'), what function it points to, and has a text description the LLM parses when deciding which tool to call. Figure 11 shows the final configuration of the tools. Note the Order Summary tool additionally has the return_direct=True parameter, which returns the output of the tool directly to the agent (otherwise the LLM would read and rephrase the order summary).

Figure 11: The OrderBot's tools

The Send Email tool (Figure 12) is the most complex. I specify in the description that the input is one pipe-delimited string containing the order summary text and the user's email. The send_message function then parses the substrings before it formats the email. Correct output from Send Email requires that Order Summary be successfully called first.

Figure 12: How emails are sent

To make the OrderBot more self-aware about its role in the pizza ordering process, the ConversationalAgent's prompt needs to be edited. The default prompt includes a prefix, a description of available tools, and a suffix. I edit the prefix to explain the OrderBot's function. The tool list is automatically populated from the existing list of tools. The suffix has placeholders for the chat history, user input, and a scratchpad for the agent to compose a response. Figure 13 shows the code for updating the prompt while Figure 14 shows the fully formatted prompt template.

Figure 13: Customizing the agent's prompt

Figure 14: Final agent prompt template

As shown in Figure 15, the final agent chain (agent_chain) comes from bringing the components together in the langchain.agents.AgentExecutor.from_agent_and_tools method. Note, custom agent (agent) gets the custom prompt (agent_prompt) by way of an LLMChain. To prevent any parsing errors from causing a problem, the agent_chain.handle_parsing_errors parameter is set to True.

Figure 15: Defining the final agent (agent_chain)

The pizza OrderBot is finally ready to use (e.g., agent_chain.run('What pizzas are available?')). The OrderBot turned out to be friendly, helpful, and addressed me by name. It could answer questions about the menu from the database exactly as shown above. Figure 16 shows an example of a correctly handled order. Figure 17 shows the email summary that was sent.

Figure 16: Sample conversation

Figure 17: Email order summary

I am largely satisfied with the prototype OrderBot. I experienced some difficulty formatting the email content. Attempts to format the order summary as an HTML table did succeed consistently. Getting the LLM to use some HTML (e.g., line breaks) would be helpful for summarizing long orders.

Summary

In this article, I have examined ChatGPT's ability to query and reason about tabular data. I found some indication that tabular data (a pizza menu) stored as plain text can inspire LLM hallucinations. Splitting the menu into multiple database tables seems to mitigate, but not eliminate, hallucinations. Finally, a prototype OrderBot capable of reading from a database and sending an email was presented. The OrderBot code is a template that can be repurposed for other custom chatbots.

Note: This article was produced with LangChain 0.0.220 and ChatGPT gpt-3.5-turbo-0301.

Sunday, June 11, 2023

An Introduction to Prompt Engineering for ChatGPT

It has been over 6 months since the public release of ChatGPT in November 2022. The service was an immediate hit. As the most sophisticated example of a large language model (LLM) to date, the Internet wasted no time in interrogating its capabilities. As LLMs mature, dependence on such AI in daily life could increase dramatically.

The interactive web app is the easiest way for users to leverage ChatGPT right now. Novel future applications will require integration into devices or services having access to cloud-hosted models through APIs.

OpenAI provides a ChatGPT API for Python and node.js, and the broader community maintains libraries for other languages. Apart from the official documentation, another great way to learn is through the short ChatGPT courses currently on offer for free by DeepLearning.AI. As a new user to the API, the most surprising takeaway from the courses was that basic Python skills combined with the OpenAI API are sufficient to build new, ChatGPT-powered systems.

This article memorializes the key lessons I learned about using OpenAI's ChatGPT API. It is intended to be a reference for myself and a tutorial for new users to the API. The resources from the short courses are available here, and the original examples produced for this article are here.

Basic setup and authentication

Getting started in Python requires two steps:

Installing the OpenAI module (pip install openai);
Acquiring an API key.

In code, the API needs to be aware of the API key. It is not recommended to hard code the key in the code itself, and such code should never be submitted to a public repository. A simple alternative is to store the API in a local file called .env and to use the dotenv module (pip install dotenv) to detect the file and load the key as an environment variable. Figure 1 provides an example of how this can work:

Figure 1

Your first API call

Everything is now in place to start talking to ChatGPT. The ChatCompletion object creates a model response given a chat conversation. The chat conversation can be a single prompt or an entire chat history. In either case, the model parses the text and returns the most likely response.

Figure 2 shows a helper function for obtaining the response for a single prompt (i.e., input message). Follow-up calls with different messages will be treated as different conversations. This is unlike the web app where history is preserved within each separate conversation.

Figure 2

The model argument configures which language model to use. The GPT-3.5 model family can generate natural language and computer code. Of these, gpt-3.5-turbo is recommended as the best performance per cost ($0.002 per 1K tokens, or approximately 750 words). The full list of available models is here.

Secondly, the temperature argument controls the randomness of the response. Allowable values are in the range [0, 2] and lower values correspond to less randomness. Even at temperature 0, however, the model is mostly but not entirely deterministic.

The ChatCompletion call returns a particular kind of object (openai.openai_object.OpenAIObject) that contains the desired response plus additional meta data. From the full output (response), we want only a small piece, response.choices[0].message["content"]. Figure 3 shows the full response and, at the very bottom, the desired relevant output.

Figure 3

The input prompt can be 'engineered' to maximize the utility of the single output message ("prompt engineering" is an emerging new skill in the application of LLMs). For example, we can tell ChatGPT to include more fields (year written, short summary) and to format the output as a JSON object. If we were interested in using this as a Python object, we could convert with Python's JSON utilities. See Figure 4 for the result.

Figure 4

This example highlights the importance of writing clear and specific instructions, which can include asking for structured output.

Multi-message chats

The ChatGPT model is stateless. Producing coherent conversations requires that at least some previous messages in the conversation be uploaded every time (increasing the number of tokens used) for the model for generate the next most likely response. The illusion of continuity is created at the expense of each API call becoming slightly more costly as a conversation unfolds.

In the above get_completion function (Figure 1), the user prompt is formatted into a list (messages) of dictionaries. Each dictionary encodes one message (content) and the source (role), which indicates the source of each piece of dialog. Role has one of three values: user (the human), assistant (the chatbot), and system (context for the assistant not meant to be visible to the user).

Figure 5 demonstrates passing multiple messages to the ChatGPT. In this case, the system role is instructing the assistant that is is to speak like Shakespeare, and that the user cannot override that style. The user then asks to hear a joke in a different style. The model apologizes that it must stick to the specified original style and then delivers a four-line joke complete with rhyme.

Figure 5

With this in mind, it is surprisingly simple to create a chatbot within a Jupyter notebook. In the example below (Figure 6), I have attempted to recreate the famous Duck season versus Rabbit season debate between Bugs Bunny and Daffy Duck. The Assistant is Daffy Duck, and the User is Bugs Bunny. Elmer Fudd is standing by with his rifle. The chat window widget is a panel widget. Source code for the chat is here.

Figure 6

Reasoning about math word problems

In this next section, let's explore whether ChatGPT can handle word problems. The answer is "yes, if prompted suitably".

I developed exercises that present options for eating out versus cooking at home. The model is supposed to evaluate each option and decide which is cheaper. In addition to choosing the right option, the model also verbalizes the cheaper option and outputs the amount saved daily and annually.

Figure 7 shows my first attempt (V1) at a prompt for the base case (eating out costs 15 dollars per meal while eating at home is 5 dollars per meal; the user always eats three times a day). ChatGPT gets the right answer (eating at home; annual savings of 365*3*($15 - $5) = $10,950). Note that it is useful in many contexts to inform the model that the user's prompt will be specially marked by delimiters such as triple backticks (```).

Figure 7

Asking the same question in a second way reveals a weakness in the prompt. In Figure 8, I present the cost of eating at home in units of cents (1/100 dollars), hoping to confuse the system. The model ends of doing the calculation in units of cents, and the output savings are quoted in cents. While this is not technically wrong, it can be considered incorrect since it is atypical to express amounts exceeding one dollar in cents.

Figure 8

Figure 9 shows the second version (V2) of the prompt in which the model is told to calculate in dollars and make the decision based on the total number of dollars spent per day. The description for the output also reminds the model that daily and annual savings need to be presented in dollars. This version of the prompt produced the expected output.

Figure 9

After just two iterations, the results were promising. However, prompts are ideally tested on a battery of examples. The two prompts were tested on the set of five examples described in Table 1. Examples 1-3 mix dollars and cents. Example 4 mixes total cost per day and the total cost per meal. Example 5 does the same while mixing dollars and cents too.

For the model to pass each example, it had to choose the right option (a or b) and calculate the total annual savings in dollars. An answer that chooses the right option while getting the units wrong is only half correct.

The total score for each prompt comes from adding the scores for the individual examples. The first prompt (V1) scored 40% while the second prompt (V2) scored 100%! Table 2 shows the detailed scoring for each test.

Let's examine in detail what V1 got wrong.

Half of Example 1 is wrong because the output savings is not in dollars.
In Example 2, the letters associated with each choice are flipped, and choice (b) is listed before choice (a). See Figure 10 for details. ChatGPT outputs the wrong letter (a) and calculates 0 zero savings. The prompt was apparently confused by the choices not following alphabetical order.
All of Example 3 is wrong because the model outputs a blob of Python code instead of numbers.
Example 4 is fully correct.
Example 5 partial credit is awarded because the units are wrong.

Figure 10

How is V2 guiding to model to get the right answer every time? The prompt is is spurring ChatGPT to write it's own robust Python code that gets the right answer every time. At the end of the code blob (Figure 11), the model is producing is constructing a dictionary in the desired output format.

Figure 11

If this prompt were included in an a user-facing app, the tests should be repeated periodically to make sure that there is no regression in performance as the prompt or underlying LLM changes. If V2 begins to fail as further examples were added, the decision to e.g., standardize the currency amounts to dollars might have to be made.

Conclusion

This article has demonstrated important strategies in designing prompts for ChatGPT LLMs.

Helpful tactics for improving prompts include providing clear and specific instructions, using delimiters (e.g., ```) to mark user input, telling the model to work through explicitly defined steps, and asking for structured output.
Prompt development is iterative and is best done in a systematic, test-driven way. Test criteria can be explicitly as shown here. It is also possible to have the LLM grade it's own response (not discussed here).
For prompts under active development, tests should be re-run periodically to prevent regressions in system performance.
Tuning applications of LLMs is very different from the process needed for other AI models.

Hopefully this article helps and inspires you to start building new applications with ChatGPT!