Ever since OpenAI released ChatGPT in November 2022, the world has been overtaken by the Generative AI storm. Investors have put billions of dollars into companies that make the Large Language Models (LLMs) behind ChatGPT and its competitors, such as Google’s Gemini, Meta’s Llama, or Anthropic’s Claude. They’ve invested billions more into startups that are developing new products that leverage Gen AI technologies.
My company, Predictika, has developed its own patented platform for conversational AI Agents. We wanted to understand, in some depth, how good such LLM chat tools are at
- Understanding arbitrary requests by users
- Accurately following the business logic inherent to a business task
- Supporting the variety of conversational flows that are natural in a particular application.
The business task we used is ordering food, conversationally, from an Italian-style restaurant whose menu is typical in its variety and complexity. This is a particularly good domain for testing the reasoning capabilities of ChatGPT-type tools, since millions of consumers order food via a variety of touchpoints. The human order takers essentially rely on their innate human intelligence to understand the orders and make sure they follow the rules of the restaurant’s menu to create correct and complete orders consistently.
ChatGPT’s failures
We suspected that ChatGPT 3.5 might fail in a few cases, so we gave it enough explicit English instructions that we expected it, in most cases, to follow the logic inherent in our menu. To our surprise, it failed in most cases that involved even simple logic. It is clear that if you want correct answers, you cannot simply rely on an LLM, or even multiple.
Here are some of the ways ChatGPT failed to take a food order, especially for customized items such as pizza:
- ChatGPT fails to do a partial match to offer a choice to the user and simply accepts one of the partially matched items, even though it does reject items that do not match at all.
- While it does reject menu items that are clearly not in the menu, it is quite happy to add options to customizable items that are not in the menu.
- ChatGPT was poor at customization.
- It forgets to ask for options.
- It asks for the wrong options, sometimes ones that are not applicable to any item in that category.
- It fails to enforce compatibility rules.
- It’s clueless about ordering an item without one of its ingredients, even if it is given an explicit description of the item’s ingredients.
- It has a hard time correctly enforcing quantity limits for options that have a max limit on how many options you can add from a group. It either ignores the limits or, if it does acknowledge the limit early in the conversation, it often ignores it later in the same session.
- Even though failure to do arithmetic is a known problem, at least with ChatGPT 3.5, we were still surprised that even for simple total price calculations, it failed in so many different ways.
- When we ordered multiple items in the same utterance that are incomplete concerning their options, it handled them inconsistently. Sometimes it forgot to ask for the missing information, even for the first item. Other times, it ignored the information we gave it and asked for it again.
- ChatGPT failed in enforcing simple constraints for half-and-half pizza, i.e., that both halves must be the same size and have the same crust. It did this despite being given explicit instructions as part of its system prompt. In some cases, it treated a half-and-half request as two separate pizzas!
- Its ability to explain itself or revise its answer when challenged looks spurious. It simply comes up with another answer — sometimes the correct one, other times equally wrong. It seems like it’s just generating a different set of sentences without any understanding of its mistake.
We noticed many other failures and have only summarized the salient ones here. The report that follows goes into detail about each example including the user input, our summary of the findings and a link to the full session with ChatGPT 3.5.
Background
With the wide availability of LLM-based chat tools (e.g., ChatGPT, Gemini, etc.) and exploding interest in developing AI Agents that can automate various enterprise business processes, we wanted to understand, in some depth, how good such LLM chat tools are at
- Understanding arbitrary requests by users
- Accurately following the business logic inherent to a business task
- Supporting the variety of conversational flows that are natural in a particular application.
The business task we have used for our testing is ordering food, conversationally, from an Italian-style restaurant whose menu is typical in its variety and complexity.
We decided to test ChatGPT 3.5 (we used the OpenAI API calls to the gpt-3.5-turbo-0125 model, not the ChatGPT web app), treating it as a proxy for all LLM based chat tools.
In a subsequent report, we will discuss our results with other LLM-based chat tools just to see if there are significant variations in results. We will also look at the latest ChatGPT release ChatGPT 4 o1 and report on it in the future.
This report should be of interest not only to those building food ordering agents, but to the wider business community that is interested in developing and deploying AI Agents using LLMs. Of particular interest to everyone would be our findings on how well LLM-based chat tools can follow simple business logic when it’s spelled out in plain English as part of the system prompt.
With its own patented conversational AI Agents platform, Predictika has been working with customers in a number of verticals such as education (e.g., website bots), restaurants (e.g., voice-based food ordering agents), hospitality (e.g., in-room customer support agents) and field service assistance agents.
Why food ordering is the test domain
For those who might be curious why we picked food ordering as the test domain, there are some good reasons for it.
- In the United States alone, the restaurant industry is a $1 trillion economy. In other words, a trillion dollars’ worth of food is ordered every year — this might be bigger than most business applications in terms of order volume, if not dollar volume.
- Almost every one of us has ordered food: in a drive-thru, over the phone, at a kiosk, via a phone app or a website, or at a restaurant counter or table. As such, readers should be able to relate to the examples that are presented here along with the interaction scenarios. You don’t need to know some esoteric skill such as computer programming, travel planning or insurance underwriting to understand these testing examples.
- Ordering food in a restaurant (or on the phone or drive-thru) is usually done conversationally as a dialogue between the user and the order taker. This requires basic language skills: understanding what the user is saying, the menu items that they are interested in, asking questions for clarifications and getting more details, and dealing with changes in the original request. When done via voice, it brings added complexity of accents and voice-to-text conversion due to the ambiguities raised due to incorrect conversions. We will skip purely voice-related issues in this document.
- Predictika has been working with a variety of restaurants (e.g., sandwich, pizza, ethnic) across a variety of channels (drive-thru, phone, website and kiosk), so we are very familiar with the many issues and challenges that come in trying to deploy AI Agents for food ordering.
- Crucially, the human order takers in restaurants are not uniformly a highly skilled workforce. In fact, they are usually barely paid above the minimum wage! But they are all inherently smart human beings. The reason why this is important is that without much training, they can engage, quite effortlessly, with random strangers, who are often harried and sometimes rude, in taking their orders. We have spent countless hours listening to how orders are placed at a major restaurant chain’s drive-through lane. The conversations can be quite long in terms of how much back and forth there is between the customer and the order taker. The agent needs to understand the customer’s intent, follow the rules of the menu, prompt the user for more information when needed or steer them away from making incorrect selections. All the while they must maintain their cool, try to do some upselling or cross-selling and are measured on the average time to complete an order.
The reliance of human food order takers on basic human intelligence — both conversational and logical reasoning skills — makes this a truly benchmark task for evaluating LLM chat tools especially when claims are made about their ability to reason and problem-solve, all the way to the ill-defined artificial general intelligence (AGI).
Menu in English
We wanted to select a menu that has items with options because that involves following the option rules, as well engaging in a dialogue with the user to obtain all the required information to get a correct and complete description of such customizable items.
We took the menu from a typical Italian pizza restaurant since pizza orders have enough complexity to be a meaningful test for LLMs’ intelligence.
The menu was originally in JSON (a commonly used computer format) and we translated it to readable English (so it would be understood by ChatGPT). But after translation, we found a few flaws and missing information that we added manually.
Here is the menu.
Structure of menus
Most menus we have examined have a four-level hierarchy. For the menu shown earlier, the top-level has Menu Categories such as Appetizers, Pizza, Calzone, Drinks or Desserts. No one really orders a Menu Category — they are mainly used to organize the next level, i.e., Menu Items. These are typically the items that people order. A menu item might be simply ordered by name, or it might have options that need to be specified to complete the description of a menu item such that it can be correctly ordered and fulfilled by the restaurant kitchen. Menu items in our above menu include
Chicken Parmesan Sandwich, New York Cheesecake, Garlic Chicken Calzone, Buffalo Wing, Vegetarian Pizza, Spaghetti with Meat Ball, etc.
which are simple items and can be ordered just by name and others such as Create Your Own Pizza, Create Your Own Calzone, Salads or Drinks, which have further options and thus can be customized.
Options are grouped as Modifier Groups. Each group lists the Modifier Items that can be selected by the user along with the minimum and maximum allowed or required that, in effect, describe rules on how many items in a group can or should be selected. In our translated English version of the menu, we converted these minimum/maximum restrictions to appropriate phrases in English that we hope will guide ChatGPT in making the correct decision and guide the user. Here is what such a rule written in English looks like:
Choose your topping.
At least 2, up to 5 and no more from the following:
Anchovies
Artichokes
Bacon
Bell Pepper
….
These descriptions will be like what you might see in a restaurant menu.
While there are some variations and complexity beyond the above description, most menus and most items in these menus can be described using the four-level hierarchy. For the purposes of this report, going into the more obscure rules in menus would not be necessary.
An order by a user would consist of one or more menu items. For customizable items, the menu item would be further qualified by the chosen options. Typically, prices are associated with menu items and options. Thus, the order total price can be rolled up from these two kinds of items (not considering taxes, service charges etc.).
Some restaurant menus are quite simple — they consist of simple choices that you can order without any options. But many menu items, such as pizza, calzone, salads or other built-to-order items, are more complex and embed some logic or have built-in rules that must be followed to order a valid item. Below, we identify some of these rules that we will be testing for later to see if ChatGPT-type tools can successfully follow these rules after being given explicit instructions.
Only items explicitly in the menu should be accepted in an order, i.e., the user should not be allowed to order what the restaurant does not sell. This applies to all the different types of entities: menu categories, menu items, options (or modifier groups) and option items (or modifier items).
Users often do not know the exact name of an item but might use a similar, partially matching name. See Session 28-1 or Session 28-2 for examples of partial matches. In some cases, the menu offers items that have common words. In such cases, it is important that the order taker offers the closest matching items for the user to choose from.
Some items, such as pizza or calzone, have additional options (grouped as modifier groups) that must be specified to complete the description of the item. For pizza, these typically include size, crust, sauce, cheese, toppings and optional modifiers (e.g. extra crisp, light sauce, extra sauce, no cheese, no oregano etc.). What we want to test is that if the user does not specify one of these required features, will the Chabot ask the user or not.
Some of the options are required and must be asked for if the user does not specify them. For pizza, these are: size, crust, sauce and toppings. You cannot really bake a pizza without knowing these. The optional modifiers are truly optional: If the user provides them, they should be considered, but the user need not be prompted to provide them.
Some of the options have a limit on how many items can be ordered from that set. For example, the user is allowed up to five toppings on a pizza or up to three ingredients in a calzone. The size of a pizza is a single choice (you cannot have two different sizes). A pizza combo is created by picking a single pizza, one drink, and one salad — and is modeled as a menu item that has three modifier groups, one each for pizza, drink and salad. The user is required (and allowed) to pick one and only one from each modifier group.
The calculation of total order is not trivial. To arrive at the total price for an item, one must roll up the base item price along with the prices for any of the options that were ordered. Given the known issues with LLMs doing arithmetic correctly, we basically assumed that ChatGPT will fail at this, but we still wanted to see how and when it fails.
Some menu items, especially drinks, come in different sizes (e.g. 12oz can or two-liter bottle). However, not every drink comes in every possible size. The bot needs to only allow valid combinations that are sold by the restaurant.
Half-and-half pizza have always bedeviled food ordering AI Agents. We tested it in three steps. First, we gave ChatGPT no instructions on how to take an order for half-and-half pizza and see how well it can do based solely on its training data, which surely included some menus and documents on such pizza orders.
Second, we included in our menu instructions that a half-and-half pizza can be created by using any of the pizza types for each half, and that half can be customized using the rules and requirements of the selected pizza type. Additional complexity comes from the fact that while some pizza options (e.g., sauce, cheese, toppings) can be separately selected for each half, others, such as size and crust, must be the same for both halves.
In the final step, we gave explicit instructions that you cannot have a pizza that is thin on one half and thick on the other. In the same vein, it cannot be small in one half and large in the other.
In our discussion of the results below, we link to the actual transcript of the sessions with ChatGPT. The transcript shows the actual menu and additional instructions that were given to ChatGPT as a system prompt.
Typical conversational flows during ordering food
Users typically do not order food in a strict top-down manner where the user orders a single menu item and is prompted for its required options, then orders the next item and so on until the order is complete.
The order flow is much more unstructured and meandering. Users will often start by asking for one or more items, possibly partially described. The order taker is responsible for following each new thread of user requests to create a correct and complete order. Every item ordered by the user must be completed to get all its required options. Every option offered to the user or accepted by the order taker must be correct. This must be done regardless of the sequence in which the items were first requested.
The users expect to be prompted for the missing information. However, when prompted, they can respond in many ways.
- Just answer the question that is asked
- Answer the question but add another item to the order
- Answer the question but change something they said earlier
- Answer the question and ask a clarifying question
- Ignore the question and add another item to the order
- Ignore the question and change something they said earlier
- Ignore the question and ask a clarifying question
In b through g cases, we will be testing the following:
Extra Information: Can the bot handle the extra information that is provided? This includes the case of, when the user starts by asking for an item that is only partially complete, e.g., “I want an 18in create your own pizza with red sauce.” Here the user has given some information (e.g., size and sauce) but not given others (e.g., crust and toppings). The bot must remember what was given and only ask for the missing information.
Manage the changing context: Does the bot keep track of the fact that the information it asked for has not been provided and it should ask again. This is especially important since, as noted above, when the user is asked for some missing information, they can change the context by asking for something else. The bot needs to remember to come back to the original context while dealing with the new request.
Broaden the context: If the user asked for a new menu item which had its own options, did the bot remember to ask for them. In other words, every new requested item creates a new context while the old context might still have unfinished business.
Change the order: Is the bot able to revise an earlier request and all its implications? Users will often change their mind in the middle of giving an order. A change could be as simple as just removing an item from the order, or it might involve getting rid of any pending unfinished business while creating a new context for the options of the newly revised choice.
Results of interactions with ChatGPT 3.5
Entities in menu
ChatGPT did pretty well in rejecting menu items that were not in the menu. See Sessions 27-1, 27-4 and 27-5.
Session 27-3 brought up a new way that ChatGPT can fail. Initially, when we asked for tandoori chicken pasta, it correctly noted that this is not a valid item and proceeded to offer items from the Pasta menu category. But later, when we asked to add tandoori chicken to chicken fettuccini alfredo, it agreed to do so even though chicken fettuccini alfredo has no such option. Clearly, it is willing to look past the menu and add things it might have seen in its training data but were not part of the menu.
We tried to add pizza toppings such as paneer or stink bug. It rejected the latter as not being allowed but did allow paneer, despite our menu having no mention of paneer. Clearly, it relied on its training data to accept paneer. This is a false positive error and would be unacceptable in a real food ordering scenario. See Sessions 11B7 and 11B8.
Partial match with menu entities
We tested for partial matches in several ways.
In Session 28-1, we ordered: “I would like to order Cheesy bread sticks.” The menu does not have such an item, but three other items match partially: Bread sticks ($6.99), Cheesy sticks ($10.99), Cheesy garlic sticks ($10.99).
It did not offer any of these as a choice and simply ordered the non-existent Cheesy bread sticks, at $10.99 each. So, it most likely just matched it to one of the cheesy sticks or cheesy garlic sticks, since it used the price of $10.99 but had no way to know that.
In Session 28-2, we ordered: “I would like to order Chicken Calzone.” There is no such item in the menu, though there are partially matching ones: BBQ Chicken Calzone and Garlic Chicken Calzone.
It not only accepted the wrong item but started asking for the size. Note that calzones have no size in our menu. Moreover, the sizes offered were from Create Your Own Pizza. Again, a rather bizarre fail!
Similar failures to do partial matches and accept the wrong item occur in Session 28-3.
Option compatibility
The only menu items in our menu that have compatibility rules are drinks that are available either in a 12oz can or one-liter bottles. However, not every drink comes in both sizes. The bot should not let the user select a drink in an incompatible size. If they specify the size first, then it should only allow drinks that are available in the size.
Session 25 is a simple case, since we asked for: I’d like a soda. And it correctly followed up by asking for the size and the type of drink (soda).
However, in Session 17A we asked for: “I’d like the Cajun Sausage Sandwich with buffalo wings and soda.” So, this was similar to the above case except that the soda was part of a longer order utterance. It did not ask for the size or type of drink and just ordered Soda (Can), which technically is incomplete since there is no such item that can be ordered. It looks like it gets lost in building proper context once there are multiple items to deal with.
In Session 30-2, we asked for: “I want a can of soda along with spinach salad with chicken.” Here, instead of asking for the kind of drink, it simply took the first choice, i.e., coke. It should have asked for the kind of drink or soda.
In Session 31-1, we asked for: “Give me buffalo wings with 2 liters of Dr Pepper.” It initially correctly noted that Dr Pepper does not come in two liters. But our response, “8 pcs for buffalo wings and for drink i have already mention it,” confused it, and it simply accepted the wrong combination. Clearly, that will be an invalid order.
In Session 31-2, we asked for: “I want a can of diet coke along with a spinach salad and chicken.” It simply added a Can of Diet Coke even though Diet Coke is not available in a can as per the menu.
Session 31-3 was quite bizarre. We ordered: “give me a can of sprite and 2 liter of diet coke.” Both of these are valid items. However, ChatGPT got the drinks all mixed up with the Desserts category and had to be prompted a couple of times to accept the order.
Limit on quantities
Our menu has two items with options that have a quantity limit. Create Your Own Calzone can have up to three toppings and pizza can have up to five toppings or up to two sauces. We tested this in many ways and ChatGPT usually did the wrong thing. See Sessions 1, 11A, where ChatGPT failed to enforce the quantity limits where the user exceeded the max number of toppings right from the get-go.
However, in Session 7, it was able to enforce the quantity limit correctly. One difference between the two cases is in the former sessions where it failed, we led with asking for six toppings, whereas in the latter cases, we tried to add an extra item after having reached the limit. It is not clear why it enforced the limit in Session 7 but not in the others. We have noticed this inconsistency in most cases where ChatGPT makes mistakes.
To dig deeper into the issue of inconsistent results, we ran the scenario of Session 11: “I’d like a Create Your Own Pizza, 18″, thick crust, with no sauce, and toppings: pepperoni, chicken, mushrooms, spinach, olives, and basil,” ten times, starting afresh each time, to see how ChatGPT would do. The results were all over the map. In each session, we tried something different after it initially accepted the order. The key results are summarized below, along with links to the individual sessions:
It always violated the quantity limit rule and allowed six toppings in each case.
a. When challenged, it simply removed the last topping. When challenged again on why it removed the last topping without asking, it added it back, thus violating the limit again. It was clear that it was in a doom loop. See Session 11B1.
b. When asked about the limit on toppings, it asked the user to remove the extra topping. See Session 11B2.
c. When challenged on accepting six toppings, it remembered the limit of five and asked the user to select five toppings. Instead, the user added two more. It accepted that and summarized the order with eight toppings. See Session 11B3.
d. In Session 11B4, we tried to confuse ChatGPT by adding three more toppings and removing a couple after the initial six. It should end up with seven — though it still violates the quantity limit. However, it ended up with six.
e. In Session 11B5, it allowed us to remove all the toppings, even though toppings are a required option (and ChatGPT seemed to know that). Despite that, it still summarized the order without any toppings.
f. In Session 11B9, we start with “No Sauce” and then try to add some sauces to Create Your Own Pizza (remember the menu allows up to two sauces). Initially, it refused to add any more sauces by claiming that the user had already said “No Sauce.” That does not seem right since the user can always go from “No Sauce” to adding some sauces. However, when we tried to add two more sauces it accepted them. So, it would not allow us to add one sauce but we could add two. Rather bizarre!
g. Session 11C is bizarre on its own. We only gave it four toppings and “No Sauce.” But when we tried to add a sauce, it complained that we had reached the limit of five toppings when we only had four. We had to tell ChapGPT that “chipotle sauce” is a sauce and not a topping, then it accepted it. This might have been the most egregious error on its part.
Price calculation
To test how well ChatGPT does with price calculation, we used a multiple item order with additional quantities for each item. Here is the requested order:
“I need 4 Garlic Chicken Pizzas, 18″ each, and 3 Bacon Cheeseburger Calzones.”
It’s a fairly simple order since the garlic pizza has only one option, i.e., size and we already specified that, and Bacon Calzone has no option. From the menu, it’s clear that the 18in Garlic Chicken Pizza is $18 and the Bacon Calzone is $15.99. Multiplying by their respective ordered quantities of four and three yields a total price of $119.97. So, we expected ChatGPT to get it right. We ran it ten times, each time starting a fresh session.
The results were shockingly all over the map, with ChatGPT showing unusual “creativity” in coming up with ever more bizarre total prices (e.g., 107.97, 119.93, 95.93, 86.97, 161.94, 107.94, etc.), some of which were even hard to reverse engineer. This was even though it did show the correct item prices in the order summary. It is clear that ChatGPT does not know how to do arithmetic. Every run produced yet another total, even though it has the equation correctly spelled out.
Here is our review of the more interesting cases out of the ten:
- In Session 9 and 9B8, it came up with a total of $107.97 against the correct price of $119.97. We have no idea how it did that.
- In Session 9A, it actually shows its math, and produces the right results. Interestingly, when asked to explain its work, it erroneously copped to making a mistake and then proceeded to show the same results again. Clearly, its explanations or mea culpa are not to be taken at face value, and are as likely to be bogus as its results are sometimes.
- In Session 9B1, it made an error we have seen some other times, where it asked for values of options for the Garlic Pizza (e.g., sauces and toppings) which don’t exist for this pizza. In other words, it got confused between Garlic Pizza, which only has size as an option, and Create Your Own Pizza, which has crust, sauce, size and toppings as options. When challenged, it persisted in asking for the options. We had to point out that these were options only for Create Your Own Pizza, then it backed off. In the case of Bacon Calzone, it asked for sauces and toppings, even though neither is a valid option for Bacon Calzone and sauce is not valid even for Create Your Own Calzone. This was an egregious hallucination. At the end, it came up with another erroneous total of $119.93 — again, it makes no sense how it lost four cents!
- In Session 9B2, the total calculated in the middle of the session was $95.93, though it shows the correct item prices and quantities.
- In Session 9B3, it finally got the total right but persisted in asking for invalid options for both the pizza and the calzone.
- In Session 9B4, it reached yet another erroneous total, this time $86.97. Upon being challenged, it came up with another wrong total of $101.97 before getting it right.
- In Session 9B5, after asking for invalid options, it came up with totals of $161.94 and $107.94 before getting it right.
- Session 9B6 and 9B9 were the rare ones where it did not ask for the invalid options and got the total right. Perhaps only two out of more than ten. Can we say that ChatGPT has an accuracy of 20%?
Menu options
One of the critical roles of an order taker (human or AI Agent) is to get the complete details of items that have options. Thus, if the user ordered an item without specifying a required option, the user should be prompted to get that information, otherwise the order is not complete. Conversely, the user should not be asked for options that are not valid for an item, and if they specify them, the extra information must be ignored, preferably by informing the user. We have already seen in the earlier section about Price Calculation, that ChatGPT asked for invalid options, sometimes ones which do not apply to any item in that category.
In the following examples, we tested for scenarios where the user gave an incomplete description. The results are mixed, though ChatGPT made mistakes more often than got it right. Sometimes ChatGPT asked the right questions to complete the item description. However, it often made a mistake if an item was not the first item in the order but was added by the user later in a session. Other times, it simply assumed some value without asking the user.
- In Session 17, when we added “buffalo wings and soda,” it did not ask for the quantity of buffalo wings or the type of soda. Without this, the order is incomplete.
- In Session 17A, we asked for everything right up front as: “I’d like the Cajun Sausage Sandwich with buffalo wings and soda.” This time, it assumed the default quantity for buffalo wings (though it should have asked the user) but left the soda incomplete, since it did not ask for the type of soda and assumed a can. Again, an incomplete order.
- Session 18A brought up some weird erroneous behaviors. We asked for a 14in Vegetarian Pizza which has no other options, but it still asked for toppings. First error. We asked to add “onions, pineapples, and paneer.” It took all three even though there are no extra toppings in the menu. Furthermore, paneer is not even a topping for Create Your Own Pizza. Also, its response is confusing (see the session). We tried to add ham, and it accepted it, though we expected that it should know that ham does not belong on vegetarian pizza. It acknowledged that when challenged. All in all, an erroneous session with ChatGPT.
- In Session 12A, we ordered: “Can I have the Southwest Chicken Sandwich without any cheese and no onions?” We had modified the menu to expand the description of the Southwest Chicken Sandwich to show its ingredients. It failed to show the deletions in the order summary but simply said that it had removed the items when prompted again.
- Session 22 is interesting, since we tried to order a Greek Spinach Calzone without spinach. The menu has no modifier group about such modifications to an item (though some menus we have seen include valid changes to an item) so we wanted to see how ChatGPT would handle it. Like a language savant, it simply erased the word spinach from the menu item and ordered us a Greek Calzone, even though no such item exists in the menu. This is a pretty serious blunder, in our opinion.
- Session 22A. We wanted to see if we explicitly tell ChatGPT that Greek Spinach Calzone includes spinach, would it handle our request to order it without spinach. That is exactly what we did in this session. The menu had this changed line: Menu Item: Greek Spinach Calzone that comes with spinach, olives, feta cheese, and chicken (Large) $15.99. But when we tried to order it without spinach, it refused to do that by saying that it comes with spinach. I guess what we expected is that ChatGPT would order it as: Greek Spinach Calzone without spinach. But obviously, it did not. When we persisted, it did the same as (#4) above. We were hoping that ChatGPT would show some understanding of language to do the right thing. But it looks like it lacks any real understanding!
- In Session 25, it asked the right questions in response to: “I want a soda.” Perhaps it was a simple request and there was only one item, so that it could handle it. We showed earlier cases where we had asked for multiple items that included a soda and it made mistakes.
- In Session 26, ChatGPT made errors of both commission and omission. It asked for crust, sauce and toppings for BBQ Chicken Pizza, which has none of these options, and did not ask for the quantity of buffalo wings. It simply assumed the default.
Half-and-half pizza
Remember from our description that above we will test each half-and-half pizza order three different ways: with no instructions, with basic description of half-and-half pizza and with additional constraint that each half must have the same crust and size. The way we will present our results is to first show the user order and then results for each of the three cases.
Order 1: “I want a half and half pizza with red sauce with onions and mushrooms on one half and white sauce with artichokes and bell pepper on the other half.”
Session 32-1 is when no instructions are given. It just gave a jumbled order where all the toppings and sauces were grouped together, and it did not ask for the size or crust. So maybe ChatGPT 3.5 had not been trained on half-and-half pizza text after all!
In Session 33-1, we gave it an extra description of what a half-and-half pizza is (see the menu portion in the session transcript). This time, it summarized the pizza with each half correctly described. However, it failed to ask about the size and crust. When prompted, it did ask for the crust but happily took a different crust for each half. Clearly an error, but we had hoped that in the trillions of tokens it was trained on, it might have figured out that each must have the same crust. No such luck!
Finally, in Session 34-1, we tried the same order but now with explicit constraints about each half having the same size and crust. This time, it did the right thing. It only asked for the size and crust once and then customized each half. So, it looks like, at least in this example, it was able to follow our instructions. However, when it gives the summary of the order it shows three pizzas — half-and-half, first half, and second half — each at the price of a single pizza. I guess it did not really understand anything!
Order 2: “I want a half and half pizza with 14in Tuscany delight pizza on one half and 18in Margherita Pizza on the other half.”
In Session 32-2, it correctly rejected the order since we had given it no instructions on half-and-half pizza and it looks like it does not know what they are from its training data. Very fair response, though surprising since it is expected that in the over one terabyte of data it was trained on there must have some text on half and half pizza.
In Session 33-2, with additional instructions on what a half and half pizza is, it seems to order it okay, but as expected, allows different crust and size for each half. One clear error is that it failed to extract the size of the second half from the initial order since it simply asked for it again. Not a big issue by itself, but this is part of the broader failure we have seen where multi-item orders cause it to lose attention. Ironic!
In Session 34-2, despite additional constraint tying size and crust for each half, it still allows different sizes and crust for each half. I think we spoke too soon when we said for Session 34-1 that it was able to follow our instructions about constraints on each half. The summary clearly shows that it allowed different sizes for each half. Interestingly, it only treated the half-and-half as two pizzas and not the three it did on Session 34-1.
Order 3: “I want a half and half pizza with thin crust create own pizza with red sauce, onions and mushrooms on one half and thick crust create own pizza with white sauce, artichokes and bell pepper on the other half.”
This is a variation of order 1 above where we tried to make explicit what type of pizza would be on each half. Note that in order 1, we did not make that explicit, so it is possible that it failed to take that order correctly.
In Session 32-3, it did not reject half-and-half pizza — which it did in Session 32-1 — but this time, it simply ordered two separate pizzas. So, it knows something about half-and-half pizza from its training data but is not clear what.
In Session 33-3, it did describe them as a single half-and-half pizza though with separate crusts. But then it priced the order as two pizzas and that is how it explained it. A bad answer.
In Session 34-3, again disregards the crust constraint and forgets to ask about size. It makes many other mistakes that are probably not worth highlighting. The conclusion from Session 34-1, Session 34-2 and Session 34-3 is unmistakable: Despite our clear instructions on the size and crust of each half being the same, it ignores the constraint in most cases.
We have tested many other scenarios that are available to those who have the patience and curiosity to dig deeper. You are to be commended if you have read this far.
Conclusion
Let us start by answering the question that we posed in the title of this article: Is ChatGPT smart enough to take a pizza order that is correct and complete and do so consistently? The answer is an unequivocal no.
ChatGPT fails in so many different ways even for simple cases of logic embedded in a menu (which, by the way, are not long), even when we augmented a menu with explicit instructions in English that would be enough for most people reading it. One cannot directly rely on the output from ChatGPT. It is clear that every conclusion it draws has to be checked for logical correctness before it can be shown to the user.
A larger issue than just failure to follow simple logic is the inconsistency of its answers — it is consistently inconsistent! A casual examination of its behavior might suggest that it is doing a good job. However, the moment we started testing it systematically, faults emerged, and they kept multiplying. Our experiment with price calculations where we tried the same order over ten times was revelatory. While arithmetic errors by ChatGPT were not unexpected — enough so that others have noticed that before us — it was the sheer variety of wrong answers for what was otherwise a simple calculation that was totally unexpected. We saw similar issues with its inability to follow the customization requirements of menu items.
Is ChatGPT good for anything, at least for our task of ordering food conversationally? It does seem to process the user input and respond with something that might be useful, provided it was fact-checked for accuracy. Sometimes we saw glimpses of its ability to handle more challenging linguistic constructs. However, they were obscured by the larger issue of its logic failures.
[Lee Thompson-Kolar edited this piece.]
The views expressed in this article are the author’s own and do not necessarily reflect Fair Observer’s editorial policy.
Support Fair Observer
We rely on your support for our independence, diversity and quality.
For more than 10 years, Fair Observer has been free, fair and independent. No billionaire owns us, no advertisers control us. We are a reader-supported nonprofit. Unlike many other publications, we keep our content free for readers regardless of where they live or whether they can afford to pay. We have no paywalls and no ads.
In the post-truth era of fake news, echo chambers and filter bubbles, we publish a plurality of perspectives from around the world. Anyone can publish with us, but everyone goes through a rigorous editorial process. So, you get fact-checked, well-reasoned content instead of noise.
We publish 2,500+ voices from 90+ countries. We also conduct education and training programs
on subjects ranging from digital media and journalism to writing and critical thinking. This
doesn’t come cheap. Servers, editors, trainers and web developers cost
money.
Please consider supporting us on a regular basis as a recurring donor or a
sustaining member.
Will you support FO’s journalism?
We rely on your support for our independence, diversity and quality.
Comment