Files
prompt-engineering-for-deve…/docs/content/C2 Building Systems with the ChatGPT API/9.评估(上) Evaluation-part1.ipynb
nowadays0421 81097456f5 Finish Systems
2023-07-23 20:14:15 +08:00

1664 lines
60 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "aa3de8c6",
"metadata": {
"height": 30
},
"source": [
"# 第九章 评估(上)——存在一个简单的正确答案时\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "c768620b",
"metadata": {},
"source": [
"在之前的章节中,我们展示了如何使用 LLM 构建应用程序,包括评估输入、处理输入以及在向用户显示输出之前进行最终输出检查。\n",
"\n",
"构建这样的系统后,如何知道它的工作情况?甚至在部署后并让用户使用它时,如何跟踪它的运行情况,发现任何缺陷,并持续改进系统的答案质量?\n",
"\n",
"在本章中,我们想与您分享一些最佳实践,用于评估 LLM 的输出。\n",
"\n",
"构建基于 LLM 的应用程序与传统的监督学习应用程序有所不同。由于可以快速构建基于 LLM 的应用程序,因此评估方法通常不从测试集开始。相反,通常会逐渐建立一组测试示例。\n",
"\n",
"在传统的监督学习环境中,需要收集训练集、开发集或保留交叉验证集,然后在整个开发过程中使用它们。\n",
"\n",
"然而,如果能够在几分钟内指定 Prompt并在几个小时内得到相应结果那么暂停很长时间去收集一千个测试样本将是一件极其痛苦的事情。因为现在可以在零个训练样本的情况下获得这个成果。\n",
"\n",
"因此,在使用 LLM 构建应用程序时,您将体会到如下的过程:\n",
"\n",
"首先,您会在只有一到三个样本的小样本中调整 Prompt并尝试让 Prompt 在它们身上起作用。\n",
"\n",
"然后当系统进行进一步的测试时您可能会遇到一些棘手的例子。Prompt 在它们身上不起作用,或者算法在它们身上不起作用。\n",
"\n",
"这就是使用 ChatGPT API 构建应用程序的开发者所经历的挑战。\n",
"\n",
"在这种情况下,您可以将这些额外的几个示例添加到您正在测试的集合中,以机会主义地添加其他棘手的示例。\n",
"\n",
"最终,您已经添加了足够的这些示例到您缓慢增长的开发集中,以至于通过手动运行每个示例来测试 Prompt 变得有些不方便。\n",
"\n",
"然后,您开始开发在这些小示例集上用于衡量性能的指标,例如平均准确性。\n",
"\n",
"这个过程的一个有趣方面是,如果您觉得您的系统已经足够好了,您可以随时停在那里,不再改进它。事实上,许多已部署的应用程序停在第一或第二个步骤,并且运行得非常好。\n",
"\n",
"需要注意的是,有很多大模型的应用程序没有实质性的风险,即使它没有给出完全正确的答案。\n",
"\n",
"但是,对于部分高风险应用,如果存在偏见或不适当的输出可能对某人造成伤害,那么收集测试集、严格评估系统的性能、确保在使用之前它能够做正确的事情,就变得更加重要。\n",
"\n",
"但是,如果您只是使用它来总结文章供自己阅读,而不是给别人看,那么可能造成的危害风险更小,您可以在这个过程中早早停止,而不必去花费更大的代价去收集更大的数据集。"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6f4062ea",
"metadata": {
"height": 47
},
"outputs": [
{
"data": {
"text/plain": [
"{'电脑和笔记本': ['TechPro 超极本',\n",
" 'BlueWave 游戏本',\n",
" 'PowerLite Convertible',\n",
" 'TechPro Desktop',\n",
" 'BlueWave Chromebook'],\n",
" '智能手机和配件': ['SmartX ProPhone'],\n",
" '专业手机': ['MobiTech PowerCase',\n",
" 'SmartX MiniPhone',\n",
" 'MobiTech Wireless Charger',\n",
" 'SmartX EarBuds'],\n",
" '电视和家庭影院系统': ['CineView 4K TV',\n",
" 'SoundMax Home Theater',\n",
" 'CineView 8K TV',\n",
" 'SoundMax Soundbar',\n",
" 'CineView OLED TV'],\n",
" '游戏机和配件': ['GameSphere X',\n",
" 'ProGamer Controller',\n",
" 'GameSphere Y',\n",
" 'ProGamer Racing Wheel',\n",
" 'GameSphere VR Headset'],\n",
" '音频设备': ['AudioPhonic Noise-Canceling Headphones',\n",
" 'WaveSound Bluetooth Speaker',\n",
" 'AudioPhonic True Wireless Earbuds',\n",
" 'WaveSound Soundbar',\n",
" 'AudioPhonic Turntable'],\n",
" '相机和摄像机': ['FotoSnap DSLR Camera',\n",
" 'ActionCam 4K',\n",
" 'FotoSnap Mirrorless Camera',\n",
" 'ZoomMaster Camcorder',\n",
" 'FotoSnap Instant Camera']}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import utils_zh\n",
"\n",
"products_and_category = utils_zh.get_products_and_category()\n",
"products_and_category"
]
},
{
"cell_type": "markdown",
"id": "d91f5384",
"metadata": {
"height": 30
},
"source": [
"## 一、找出相关产品和类别名称"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ac683bfb",
"metadata": {},
"outputs": [],
"source": [
"def find_category_and_product_v1(user_input,products_and_category):\n",
" \"\"\"\n",
" 从用户输入中获取到产品和类别\n",
"\n",
" 参数:\n",
" user_input用户的查询\n",
" products_and_category产品类型和对应产品的字典\n",
" \"\"\"\n",
" \n",
" delimiter = \"####\"\n",
" system_message = f\"\"\"\n",
" 您将提供客户服务查询。\\\n",
" 客户服务查询将用{delimiter}字符分隔。\n",
" 输出一个 Python 列表,列表中的每个对象都是 Json 对象,每个对象的格式如下:\n",
" '类别': <电脑和笔记本, 智能手机和配件, 电视和家庭影院系统, \\\n",
" 游戏机和配件, 音频设备, 相机和摄像机中的一个>,\n",
" 以及\n",
" '名称': <必须在下面允许的产品中找到的产品列表>\n",
" \n",
" 其中类别和产品必须在客户服务查询中找到。\n",
" 如果提到了一个产品,它必须与下面允许的产品列表中的正确类别关联。\n",
" 如果没有找到产品或类别,输出一个空列表。\n",
" \n",
" 根据产品名称和产品类别与客户服务查询的相关性,列出所有相关的产品。\n",
" 不要从产品的名称中假设任何特性或属性,如相对质量或价格。\n",
" \n",
" 允许的产品以 JSON 格式提供。\n",
" 每个项目的键代表类别。\n",
" 每个项目的值是该类别中的产品列表。\n",
" 允许的产品:{products_and_category}\n",
" \n",
" \"\"\"\n",
" \n",
" few_shot_user_1 = \"\"\"我想要最贵的电脑。\"\"\"\n",
" few_shot_assistant_1 = \"\"\" \n",
" [{'category': '电脑和笔记本', \\\n",
"'products': ['TechPro 超极本', 'BlueWave 游戏本', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n",
" \"\"\"\n",
" \n",
" messages = [ \n",
" {'role':'system', 'content': system_message}, \n",
" {'role':'user', 'content': f\"{delimiter}{few_shot_user_1}{delimiter}\"}, \n",
" {'role':'assistant', 'content': few_shot_assistant_1 },\n",
" {'role':'user', 'content': f\"{delimiter}{user_input}{delimiter}\"}, \n",
" ] \n",
" return get_completion_from_messages(messages)"
]
},
{
"cell_type": "markdown",
"id": "aca82030",
"metadata": {
"height": 30
},
"source": [
"## 二、在一些查询上进行评估"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "cacb96b2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
" [{'category': '电视和家庭影院系统', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n"
]
}
],
"source": [
"# 第一个评估的查询\n",
"customer_msg_0 = f\"\"\"如果我预算有限,我可以买哪款电视?\"\"\"\n",
"\n",
"products_by_category_0 = find_category_and_product_v1(customer_msg_0,\n",
" products_and_category)\n",
"print(products_by_category_0)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "04364405",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
" [{'category': '智能手机和配件', 'products': ['MobiTech PowerCase', 'SmartX MiniPhone', 'MobiTech Wireless Charger', 'SmartX EarBuds']}]\n"
]
}
],
"source": [
"customer_msg_1 = f\"\"\"我需要一个智能手机的充电器\"\"\"\n",
"\n",
"products_by_category_1 = find_category_and_product_v1(customer_msg_1,\n",
" products_and_category)\n",
"print(products_by_category_1)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "66e9ecd0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\" \\n [{'category': '电脑和笔记本', 'products': ['TechPro 超极本', 'BlueWave 游戏本', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"customer_msg_2 = f\"\"\"\n",
"你们有哪些电脑?\"\"\"\n",
"\n",
"products_by_category_2 = find_category_and_product_v1(customer_msg_2,\n",
" products_and_category)\n",
"products_by_category_2"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "112cfd5f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
" [{'category': '智能手机和配件', 'products': ['SmartX ProPhone']}, {'category': '相机和摄像机', 'products': ['FotoSnap DSLR Camera']}]\n",
" \n",
" [{'category': '电视和家庭影院系统', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n"
]
}
],
"source": [
"customer_msg_3 = f\"\"\"\n",
"告诉我关于smartx pro手机和fotosnap相机的信息那款DSLR的。\n",
"我预算有限,你们有哪些性价比高的电视推荐?\"\"\"\n",
"\n",
"products_by_category_3 = find_category_and_product_v1(customer_msg_3,\n",
" products_and_category)\n",
"print(products_by_category_3)"
]
},
{
"cell_type": "markdown",
"id": "d58f15be",
"metadata": {},
"source": [
"它看起来像是输出了正确的数据,但没有按照要求的格式输出。这使得将其解析为 Python 字典列表更加困难。"
]
},
{
"cell_type": "markdown",
"id": "ff2af235",
"metadata": {
"height": 30
},
"source": [
"## 三、更难的测试用例\n",
"\n",
"找出一些在实际使用中,模型表现不如预期的查询。"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "5b11172f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
" [{'category': '电视和家庭影院系统', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n",
" [{'category': '游戏机和配件', 'products': ['GameSphere X', 'ProGamer Controller', 'GameSphere Y', 'ProGamer Racing Wheel', 'GameSphere VR Headset']}]\n",
" [{'category': '电脑和笔记本', 'products': ['TechPro 超极本', 'BlueWave 游戏本', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n"
]
}
],
"source": [
"customer_msg_4 = f\"\"\"\n",
"告诉我关于CineView电视的信息那款8K的还有Gamesphere游戏机X款的。\n",
"我预算有限,你们有哪些电脑?\"\"\"\n",
"\n",
"products_by_category_4 = find_category_and_product_v1(customer_msg_4,products_and_category)\n",
"print(products_by_category_4)"
]
},
{
"cell_type": "markdown",
"id": "92b63d8b",
"metadata": {
"height": 30
},
"source": [
"## 四、修改指令以处理难测试用例"
]
},
{
"cell_type": "markdown",
"id": "ddcee6a5",
"metadata": {},
"source": [
"我们在提示中添加了以下内容,不要输出任何不在 JSON 格式中的附加文本,并添加了第二个示例,使用用户和助手消息进行 few-shot 提示。"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "d3b183bf",
"metadata": {},
"outputs": [],
"source": [
"def find_category_and_product_v2(user_input,products_and_category):\n",
" \"\"\"\n",
" 从用户输入中获取到产品和类别\n",
"\n",
" 添加:不要输出任何不符合 JSON 格式的额外文本。\n",
" 添加了第二个示例(用于 few-shot 提示),用户询问最便宜的计算机。\n",
" 在这两个 few-shot 示例中,显示的响应只是 JSON 格式的完整产品列表。\n",
"\n",
" 参数:\n",
" user_input用户的查询\n",
" products_and_category产品类型和对应产品的字典 \n",
" \"\"\"\n",
" delimiter = \"####\"\n",
" system_message = f\"\"\"\n",
" 您将提供客户服务查询。\\\n",
" 客户服务查询将用{delimiter}字符分隔。\n",
" 输出一个 Python列表列表中的每个对象都是 JSON 对象,每个对象的格式如下:\n",
" '类别': <电脑和笔记本, 智能手机和配件, 电视和家庭影院系统, \\\n",
" 游戏机和配件, 音频设备, 相机和摄像机中的一个>,\n",
" 以及\n",
" '名称': <必须在下面允许的产品中找到的产品列表>\n",
" 不要输出任何不是 JSON 格式的额外文本。\n",
" 输出请求的 JSON 后,不要写任何解释性的文本。\n",
" \n",
" 其中类别和产品必须在客户服务查询中找到。\n",
" 如果提到了一个产品,它必须与下面允许的产品列表中的正确类别关联。\n",
" 如果没有找到产品或类别,输出一个空列表。\n",
" \n",
" 根据产品名称和产品类别与客户服务查询的相关性,列出所有相关的产品。\n",
" 不要从产品的名称中假设任何特性或属性,如相对质量或价格。\n",
" \n",
" 允许的产品以 JSON 格式提供。\n",
" 每个项目的键代表类别。\n",
" 每个项目的值是该类别中的产品列表。\n",
" 允许的产品:{products_and_category}\n",
" \n",
" \"\"\"\n",
" \n",
" few_shot_user_1 = \"\"\"我想要最贵的电脑。你推荐哪款?\"\"\"\n",
" few_shot_assistant_1 = \"\"\" \n",
" [{'category': '电脑和笔记本', \\\n",
"'products': ['TechPro 超极本', 'BlueWave 游戏本', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n",
" \"\"\"\n",
" \n",
" few_shot_user_2 = \"\"\"我想要最便宜的电脑。你推荐哪款?\"\"\"\n",
" few_shot_assistant_2 = \"\"\" \n",
" [{'category': '电脑和笔记本', \\\n",
"'products': ['TechPro 超极本', 'BlueWave 游戏本', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n",
" \"\"\"\n",
" \n",
" messages = [ \n",
" {'role':'system', 'content': system_message}, \n",
" {'role':'user', 'content': f\"{delimiter}{few_shot_user_1}{delimiter}\"}, \n",
" {'role':'assistant', 'content': few_shot_assistant_1 },\n",
" {'role':'user', 'content': f\"{delimiter}{few_shot_user_2}{delimiter}\"}, \n",
" {'role':'assistant', 'content': few_shot_assistant_2 },\n",
" {'role':'user', 'content': f\"{delimiter}{user_input}{delimiter}\"}, \n",
" ] \n",
" return get_completion_from_messages(messages)"
]
},
{
"cell_type": "markdown",
"id": "83e8ab86",
"metadata": {
"height": 30
},
"source": [
"## 五、在难测试用例上评估修改后的指令"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "4a547b34",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
" [{'category': '智能手机和配件', 'products': ['SmartX ProPhone']}, {'category': '相机和摄像机', 'products': ['FotoSnap DSLR Camera', 'ActionCam 4K', 'FotoSnap Mirrorless Camera', 'ZoomMaster Camcorder', 'FotoSnap Instant Camera']}, {'category': '电视和家庭影院系统', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n",
" \n"
]
}
],
"source": [
"customer_msg_3 = f\"\"\"\n",
"告诉我关于smartx pro手机和fotosnap相机的信息那款DSLR的。\n",
"另外,你们有哪些电视?\"\"\"\n",
"\n",
"products_by_category_3 = find_category_and_product_v2(customer_msg_3,\n",
" products_and_category)\n",
"print(products_by_category_3)"
]
},
{
"cell_type": "markdown",
"id": "22a0a17b",
"metadata": {
"height": 30
},
"source": [
"## 六、回归测试:验证模型在以前的测试用例上仍然有效\n",
"\n",
"检查并修复模型以提高难以测试的用例效果,同时确保此修正不会对先前的测试用例性能造成负面影响。"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "b5ba773b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
" [{'category': '电视和家庭影院系统', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n"
]
}
],
"source": [
"customer_msg_0 = f\"\"\"如果我预算有限,我可以买哪款电视?\"\"\"\n",
"\n",
"products_by_category_0 = find_category_and_product_v2(customer_msg_0,\n",
" products_and_category)\n",
"print(products_by_category_0)"
]
},
{
"cell_type": "markdown",
"id": "4440ce1f",
"metadata": {
"height": 30
},
"source": [
"## 七、收集开发集进行自动化测试"
]
},
{
"cell_type": "markdown",
"id": "2af63218",
"metadata": {},
"source": [
"当您要调整的开发集不仅仅是一小部分示例时,开始自动化测试过程就变得有用了。"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "8a0b751f",
"metadata": {
"height": 207
},
"outputs": [],
"source": [
"msg_ideal_pairs_set = [\n",
" \n",
" # eg 0\n",
" {'customer_msg':\"\"\"如果我预算有限,我可以买哪种电视?\"\"\",\n",
" 'ideal_answer':{\n",
" '电视和家庭影院系统':set(\n",
" ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']\n",
" )}\n",
" },\n",
"\n",
" # eg 1\n",
" {'customer_msg':\"\"\"我需要一个智能手机的充电器\"\"\",\n",
" 'ideal_answer':{\n",
" '智能手机和配件':set(\n",
" ['MobiTech PowerCase', 'MobiTech Wireless Charger', 'SmartX EarBuds']\n",
" )}\n",
" },\n",
" # eg 2\n",
" {'customer_msg':f\"\"\"你有什么样的电脑\"\"\",\n",
" 'ideal_answer':{\n",
" '电脑和笔记本':set(\n",
" ['TechPro 超极本', 'BlueWave 游戏本', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook'\n",
" ])\n",
" }\n",
" },\n",
"\n",
" # eg 3\n",
" {'customer_msg':f\"\"\"告诉我关于smartx pro手机和fotosnap相机的信息那款DSLR的。\\\n",
"另外,你们有哪些电视?\"\"\",\n",
" 'ideal_answer':{\n",
" '智能手机和配件':set(\n",
" ['SmartX ProPhone']),\n",
" '相机和摄像机':set(\n",
" ['FotoSnap DSLR Camera']),\n",
" '电视和家庭影院系统':set(\n",
" ['CineView 4K TV', 'SoundMax Home Theater','CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV'])\n",
" }\n",
" }, \n",
" \n",
" # eg 4\n",
" {'customer_msg':\"\"\"告诉我关于CineView电视那款8K电视、\\\n",
" Gamesphere游戏机和X游戏机的信息。我的预算有限你们有哪些电脑\"\"\",\n",
" 'ideal_answer':{\n",
" '电视和家庭影院系统':set(\n",
" ['CineView 8K TV']),\n",
" '游戏机和配件':set(\n",
" ['GameSphere X']),\n",
" '电脑和笔记本':set(\n",
" ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook'])\n",
" }\n",
" },\n",
" \n",
" # eg 5\n",
" {'customer_msg':f\"\"\"你们有哪些智能手机\"\"\",\n",
" 'ideal_answer':{\n",
" '智能手机和配件':set(\n",
" ['SmartX ProPhone', 'MobiTech PowerCase', 'SmartX MiniPhone', 'MobiTech Wireless Charger', 'SmartX EarBuds'\n",
" ])\n",
" }\n",
" },\n",
" # eg 6\n",
" {'customer_msg':f\"\"\"我预算有限。你能向我推荐一些智能手机吗?\"\"\",\n",
" 'ideal_answer':{\n",
" '智能手机和配件':set(\n",
" ['SmartX EarBuds', 'SmartX MiniPhone', 'MobiTech PowerCase', 'SmartX ProPhone', 'MobiTech Wireless Charger']\n",
" )}\n",
" },\n",
"\n",
" # eg 7 # this will output a subset of the ideal answer\n",
" {'customer_msg':f\"\"\"有哪些游戏机适合我喜欢赛车游戏的朋友?\"\"\",\n",
" 'ideal_answer':{\n",
" '游戏机和配件':set([\n",
" 'GameSphere X',\n",
" 'ProGamer Controller',\n",
" 'GameSphere Y',\n",
" 'ProGamer Racing Wheel',\n",
" 'GameSphere VR Headset'\n",
" ])}\n",
" },\n",
" # eg 8\n",
" {'customer_msg':f\"\"\"送给我摄像师朋友什么礼物合适?\"\"\",\n",
" 'ideal_answer': {\n",
" '相机和摄像机':set([\n",
" 'FotoSnap DSLR Camera', 'ActionCam 4K', 'FotoSnap Mirrorless Camera', 'ZoomMaster Camcorder', 'FotoSnap Instant Camera'\n",
" ])}\n",
" },\n",
" \n",
" # eg 9\n",
" {'customer_msg':f\"\"\"我想要一台热水浴缸时光机\"\"\",\n",
" 'ideal_answer': []\n",
" }\n",
" \n",
"]\n"
]
},
{
"cell_type": "markdown",
"id": "6e0f1db4",
"metadata": {
"height": 30
},
"source": [
"## 八、通过与理想答案比较来评估测试用例"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "d9530285",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"def eval_response_with_ideal(response,\n",
" ideal,\n",
" debug=False):\n",
" \"\"\"\n",
" 评估回复是否与理想答案匹配\n",
" \n",
" 参数:\n",
" response: 回复的内容\n",
" ideal: 理想的答案\n",
" debug: 是否打印调试信息\n",
" \"\"\"\n",
" if debug:\n",
" print(\"回复:\")\n",
" print(response)\n",
" \n",
" # json.loads() 只能解析双引号,因此此处将单引号替换为双引号\n",
" json_like_str = response.replace(\"'\",'\"')\n",
" \n",
" # 解析为一系列的字典\n",
" l_of_d = json.loads(json_like_str)\n",
" \n",
" # 当响应为空,即没有找到任何商品时\n",
" if l_of_d == [] and ideal == []:\n",
" return 1\n",
" \n",
" # 另外一种异常情况是,标准答案数量与回复答案数量不匹配\n",
" elif l_of_d == [] or ideal == []:\n",
" return 0\n",
" \n",
" # 统计正确答案数量\n",
" correct = 0 \n",
" \n",
" if debug:\n",
" print(\"l_of_d is\")\n",
" print(l_of_d)\n",
"\n",
" # 对每一个问答对 \n",
" for d in l_of_d:\n",
"\n",
" # 获取产品和目录\n",
" cat = d.get('category')\n",
" prod_l = d.get('products')\n",
" # 有获取到产品和目录\n",
" if cat and prod_l:\n",
" # convert list to set for comparison\n",
" prod_set = set(prod_l)\n",
" # get ideal set of products\n",
" ideal_cat = ideal.get(cat)\n",
" if ideal_cat:\n",
" prod_set_ideal = set(ideal.get(cat))\n",
" else:\n",
" if debug:\n",
" print(f\"没有在标准答案中找到目录 {cat}\")\n",
" print(f\"标准答案: {ideal}\")\n",
" continue\n",
" \n",
" if debug:\n",
" print(\"产品集合:\\n\",prod_set)\n",
" print()\n",
" print(\"标准答案的产品集合:\\n\",prod_set_ideal)\n",
"\n",
" # 查找到的产品集合和标准的产品集合一致\n",
" if prod_set == prod_set_ideal:\n",
" if debug:\n",
" print(\"正确\")\n",
" correct +=1\n",
" else:\n",
" print(\"错误\")\n",
" print(f\"产品集合: {prod_set}\")\n",
" print(f\"标准的产品集合: {prod_set_ideal}\")\n",
" if prod_set <= prod_set_ideal:\n",
" print(\"回答是标准答案的一个子集\")\n",
" elif prod_set >= prod_set_ideal:\n",
" print(\"回答是标准答案的一个超集\")\n",
"\n",
" # 计算正确答案数\n",
" pc_correct = correct / len(l_of_d)\n",
" \n",
" return pc_correct"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "e06d9fe3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"用户提问: 有哪些游戏机适合我喜欢赛车游戏的朋友?\n",
"标准答案: {'游戏机和配件': {'ProGamer Racing Wheel', 'ProGamer Controller', 'GameSphere Y', 'GameSphere VR Headset', 'GameSphere X'}}\n"
]
}
],
"source": [
"print(f'用户提问: {msg_ideal_pairs_set[7][\"customer_msg\"]}')\n",
"print(f'标准答案: {msg_ideal_pairs_set[7][\"ideal_answer\"]}')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "2ff332b4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"回答: \n",
" [{'category': '游戏机和配件', 'products': ['GameSphere X', 'ProGamer Controller', 'GameSphere Y', 'ProGamer Racing Wheel', 'GameSphere VR Headset']}]\n",
" \n"
]
},
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"response = find_category_and_product_v2(msg_ideal_pairs_set[7][\"customer_msg\"],\n",
" products_and_category)\n",
"print(f'回答: {response}')\n",
"\n",
"eval_response_with_ideal(response,\n",
" msg_ideal_pairs_set[7][\"ideal_answer\"])"
]
},
{
"cell_type": "markdown",
"id": "d1313b17",
"metadata": {
"height": 30
},
"source": [
"## 九、在所有测试用例上运行评估,并计算正确的用例比例\n",
"\n",
"注意:如果任何 API 调用超时,将无法运行"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "d39407c0",
"metadata": {
"height": 30
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"示例 0\n",
"0: 1.0\n",
"示例 1\n",
"错误\n",
"产品集合: {'SmartX ProPhone'}\n",
"标准的产品集合: {'MobiTech Wireless Charger', 'SmartX EarBuds', 'MobiTech PowerCase'}\n",
"1: 0.0\n",
"示例 2\n",
"2: 1.0\n",
"示例 3\n",
"3: 1.0\n",
"示例 4\n",
"错误\n",
"产品集合: {'SoundMax Home Theater', 'CineView 8K TV', 'CineView 4K TV', 'CineView OLED TV', 'SoundMax Soundbar'}\n",
"标准的产品集合: {'CineView 8K TV'}\n",
"回答是标准答案的一个超集\n",
"错误\n",
"产品集合: {'ProGamer Racing Wheel', 'ProGamer Controller', 'GameSphere Y', 'GameSphere VR Headset', 'GameSphere X'}\n",
"标准的产品集合: {'GameSphere X'}\n",
"回答是标准答案的一个超集\n",
"错误\n",
"产品集合: {'TechPro 超极本', 'TechPro Desktop', 'BlueWave Chromebook', 'PowerLite Convertible', 'BlueWave 游戏本'}\n",
"标准的产品集合: {'TechPro Desktop', 'BlueWave Chromebook', 'TechPro Ultrabook', 'PowerLite Convertible', 'BlueWave Gaming Laptop'}\n",
"4: 0.0\n",
"示例 5\n",
"错误\n",
"产品集合: {'SmartX ProPhone'}\n",
"标准的产品集合: {'MobiTech Wireless Charger', 'SmartX EarBuds', 'SmartX MiniPhone', 'SmartX ProPhone', 'MobiTech PowerCase'}\n",
"回答是标准答案的一个子集\n",
"5: 0.0\n",
"示例 6\n",
"错误\n",
"产品集合: {'SmartX ProPhone'}\n",
"标准的产品集合: {'MobiTech Wireless Charger', 'SmartX EarBuds', 'SmartX MiniPhone', 'SmartX ProPhone', 'MobiTech PowerCase'}\n",
"回答是标准答案的一个子集\n",
"6: 0.0\n",
"示例 7\n",
"7: 1.0\n",
"示例 8\n",
"8: 1.0\n",
"示例 9\n",
"9: 1\n",
"正确比例为 10: 0.6\n"
]
}
],
"source": [
"import time\n",
"\n",
"score_accum = 0\n",
"for i, pair in enumerate(msg_ideal_pairs_set):\n",
" time.sleep(20)\n",
" print(f\"示例 {i}\")\n",
" \n",
" customer_msg = pair['customer_msg']\n",
" ideal = pair['ideal_answer']\n",
" \n",
" # print(\"Customer message\",customer_msg)\n",
" # print(\"ideal:\",ideal)\n",
" response = find_category_and_product_v2(customer_msg,\n",
" products_and_category)\n",
"\n",
" \n",
" # print(\"products_by_category\",products_by_category)\n",
" score = eval_response_with_ideal(response,ideal,debug=False)\n",
" print(f\"{i}: {score}\")\n",
" score_accum += score\n",
" \n",
"\n",
"n_examples = len(msg_ideal_pairs_set)\n",
"fraction_correct = score_accum / n_examples\n",
"print(f\"正确比例为 {n_examples}: {fraction_correct}\")"
]
},
{
"cell_type": "markdown",
"id": "5d885db6",
"metadata": {},
"source": [
"使用 Prompt 构建应用程序的工作流程与使用监督学习构建应用程序的工作流程非常不同。\n",
"\n",
"因此,我们认为这是需要记住的一件好事,当您正在构建监督学习模型时,会感觉到迭代速度快了很多。\n",
"\n",
"如果您并未亲身体验,可能会惊叹于仅有手动构建的极少样本,就可以产生高效的评估方法。您可能会认为,仅有 10 个样本是不具备统计意义的。但当您真正运用这种方式时,您可能会对向开发集中添加一些复杂样本所带来的效果提升感到惊讶。\n",
"\n",
"这对于帮助您和您的团队找到有效的 Prompt 和有效的系统非常有帮助。\n",
"\n",
"在本章中,输出可以被定量评估,就像有一个期望的输出一样,您可以判断它是否给出了这个期望的输出。\n",
"\n",
"在下一章中,我们将探讨如何在更加模糊的情况下评估我们的输出。即正确答案可能不那么明确的情况。"
]
},
{
"cell_type": "markdown",
"id": "7637f00a",
"metadata": {},
"source": [
"## 十、英文版"
]
},
{
"cell_type": "markdown",
"id": "9a238fac",
"metadata": {},
"source": [
"**1. 找出产品和类别名称**"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "86965e88",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Computers and Laptops': ['TechPro Ultrabook',\n",
" 'BlueWave Gaming Laptop',\n",
" 'PowerLite Convertible',\n",
" 'TechPro Desktop',\n",
" 'BlueWave Chromebook'],\n",
" 'Smartphones and Accessories': ['SmartX ProPhone',\n",
" 'MobiTech PowerCase',\n",
" 'SmartX MiniPhone',\n",
" 'MobiTech Wireless Charger',\n",
" 'SmartX EarBuds'],\n",
" 'Televisions and Home Theater Systems': ['CineView 4K TV',\n",
" 'SoundMax Home Theater',\n",
" 'CineView 8K TV',\n",
" 'SoundMax Soundbar',\n",
" 'CineView OLED TV'],\n",
" 'Gaming Consoles and Accessories': ['GameSphere X',\n",
" 'ProGamer Controller',\n",
" 'GameSphere Y',\n",
" 'ProGamer Racing Wheel',\n",
" 'GameSphere VR Headset'],\n",
" 'Audio Equipment': ['AudioPhonic Noise-Canceling Headphones',\n",
" 'WaveSound Bluetooth Speaker',\n",
" 'AudioPhonic True Wireless Earbuds',\n",
" 'WaveSound Soundbar',\n",
" 'AudioPhonic Turntable'],\n",
" 'Cameras and Camcorders': ['FotoSnap DSLR Camera',\n",
" 'ActionCam 4K',\n",
" 'FotoSnap Mirrorless Camera',\n",
" 'ZoomMaster Camcorder',\n",
" 'FotoSnap Instant Camera']}"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import utils_en\n",
"\n",
"products_and_category = utils_en.get_products_and_category()\n",
"products_and_category"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "34ebb288",
"metadata": {},
"outputs": [],
"source": [
"def find_category_and_product_v1(user_input, products_and_category):\n",
" \"\"\"\n",
" 从用户输入中获取到产品和类别\n",
"\n",
" 参数:\n",
" user_input用户的查询\n",
" products_and_category产品类型和对应产品的字典\n",
" \"\"\"\n",
"\n",
" # 分隔符\n",
" delimiter = \"####\"\n",
" # 定义的系统信息,陈述了需要 GPT 完成的工作\n",
" system_message = f\"\"\"\n",
" You will be provided with customer service queries. \\\n",
" The customer service query will be delimited with {delimiter} characters.\n",
" Output a Python list of json objects, where each object has the following format:\n",
" 'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \\\n",
" Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>,\n",
" AND\n",
" 'products': <a list of products that must be found in the allowed products below>\n",
"\n",
"\n",
" Where the categories and products must be found in the customer service query.\n",
" If a product is mentioned, it must be associated with the correct category in the allowed products list below.\n",
" If no products or categories are found, output an empty list.\n",
" \n",
"\n",
" List out all products that are relevant to the customer service query based on how closely it relates\n",
" to the product name and product category.\n",
" Do not assume, from the name of the product, any features or attributes such as relative quality or price.\n",
"\n",
" The allowed products are provided in JSON format.\n",
" The keys of each item represent the category.\n",
" The values of each item is a list of products that are within that category.\n",
" Allowed products: {products_and_category}\n",
" \n",
"\n",
" \"\"\"\n",
" # 给出几个示例\n",
" few_shot_user_1 = \"\"\"I want the most expensive computer.\"\"\"\n",
" few_shot_assistant_1 = \"\"\" \n",
" [{'category': 'Computers and Laptops', \\\n",
"'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n",
" \"\"\"\n",
" \n",
" messages = [ \n",
" {'role':'system', 'content': system_message}, \n",
" {'role':'user', 'content': f\"{delimiter}{few_shot_user_1}{delimiter}\"}, \n",
" {'role':'assistant', 'content': few_shot_assistant_1 },\n",
" {'role':'user', 'content': f\"{delimiter}{user_input}{delimiter}\"}, \n",
" ] \n",
" return get_completion_from_messages(messages)\n"
]
},
{
"cell_type": "markdown",
"id": "acb7d1a0",
"metadata": {},
"source": [
"**2. 在一些查询上进行评估**"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "b0d7c381",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
" [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n"
]
}
],
"source": [
"# 第一个评估的查询\n",
"customer_msg_0 = f\"\"\"Which TV can I buy if I'm on a budget?\"\"\"\n",
"\n",
"products_by_category_0 = find_category_and_product_v1(customer_msg_0,\n",
" products_and_category)\n",
"print(products_by_category_0)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "562693ab",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
" [{'category': 'Smartphones and Accessories', 'products': ['MobiTech PowerCase', 'MobiTech Wireless Charger', 'SmartX EarBuds']}]\n",
" \n"
]
}
],
"source": [
"# 第二个评估的查询\n",
"customer_msg_1 = f\"\"\"I need a charger for my smartphone\"\"\"\n",
"\n",
"products_by_category_1 = find_category_and_product_v1(customer_msg_1,\n",
" products_and_category)\n",
"print(products_by_category_1)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "64b81024",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\" \\n [{'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\""
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 第三个评估查询\n",
"customer_msg_2 = f\"\"\"\n",
"What computers do you have?\"\"\"\n",
"\n",
"products_by_category_2 = find_category_and_product_v1(customer_msg_2,\n",
" products_and_category)\n",
"products_by_category_2"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "4d5dcf7d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
" [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}, {'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n"
]
}
],
"source": [
"# 第四个查询,更复杂\n",
"customer_msg_3 = f\"\"\"\n",
"tell me about the smartx pro phone and the fotosnap camera, the dslr one.\n",
"Also, what TVs do you have?\"\"\"\n",
"\n",
"products_by_category_3 = find_category_and_product_v1(customer_msg_3,\n",
" products_and_category)\n",
"print(products_by_category_3)"
]
},
{
"cell_type": "markdown",
"id": "c5d18197",
"metadata": {},
"source": [
"**3. 更难的测试用例**"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "a3d42e0e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
" [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 8K TV']}, {'category': 'Gaming Consoles and Accessories', 'products': ['GameSphere X']}, {'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n"
]
}
],
"source": [
"customer_msg_4 = f\"\"\"\n",
"tell me about the CineView TV, the 8K one, Gamesphere console, the X one.\n",
"I'm on a budget, what computers do you have?\"\"\"\n",
"\n",
"products_by_category_4 = find_category_and_product_v1(customer_msg_4,\n",
" products_and_category)\n",
"print(products_by_category_4)"
]
},
{
"cell_type": "markdown",
"id": "302afee1",
"metadata": {},
"source": [
"**4. 修改指令**"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "33b649e4",
"metadata": {},
"outputs": [],
"source": [
"def find_category_and_product_v2(user_input, products_and_category):\n",
" \"\"\"\n",
" 从用户输入中获取到产品和类别\n",
" 添加:不要输出任何不符合 JSON 格式的额外文本。\n",
" 添加了第二个示例(用于 few-shot 提示),用户询问最便宜的计算机。\n",
" 在这两个 few-shot 示例中,显示的响应只是 JSON 格式的完整产品列表。\n",
"\n",
" 参数:\n",
" user_input用户的查询\n",
" products_and_category产品类型和对应产品的字典\n",
" \"\"\"\n",
" delimiter = \"####\"\n",
" system_message = f\"\"\"\n",
" You will be provided with customer service queries. \\\n",
" The customer service query will be delimited with {delimiter} characters.\n",
" Output a Python list of JSON objects, where each object has the following format:\n",
" 'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \\\n",
" Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>,\n",
" AND\n",
" 'products': <a list of products that must be found in the allowed products below>\n",
" Do not output any additional text that is not in JSON format.\n",
" Do not write any explanatory text after outputting the requested JSON.\n",
"\n",
"\n",
" Where the categories and products must be found in the customer service query.\n",
" If a product is mentioned, it must be associated with the correct category in the allowed products list below.\n",
" If no products or categories are found, output an empty list.\n",
" \n",
"\n",
" List out all products that are relevant to the customer service query based on how closely it relates\n",
" to the product name and product category.\n",
" Do not assume, from the name of the product, any features or attributes such as relative quality or price.\n",
"\n",
" The allowed products are provided in JSON format.\n",
" The keys of each item represent the category.\n",
" The values of each item is a list of products that are within that category.\n",
" Allowed products: {products_and_category}\n",
" \n",
"\n",
" \"\"\"\n",
" \n",
" few_shot_user_1 = \"\"\"I want the most expensive computer. What do you recommend?\"\"\"\n",
" few_shot_assistant_1 = \"\"\" \n",
" [{'category': 'Computers and Laptops', \\\n",
"'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n",
" \"\"\"\n",
" \n",
" few_shot_user_2 = \"\"\"I want the most cheapest computer. What do you recommend?\"\"\"\n",
" few_shot_assistant_2 = \"\"\" \n",
" [{'category': 'Computers and Laptops', \\\n",
"'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n",
" \"\"\"\n",
" \n",
" messages = [ \n",
" {'role':'system', 'content': system_message}, \n",
" {'role':'user', 'content': f\"{delimiter}{few_shot_user_1}{delimiter}\"}, \n",
" {'role':'assistant', 'content': few_shot_assistant_1 },\n",
" {'role':'user', 'content': f\"{delimiter}{few_shot_user_2}{delimiter}\"}, \n",
" {'role':'assistant', 'content': few_shot_assistant_2 },\n",
" {'role':'user', 'content': f\"{delimiter}{user_input}{delimiter}\"}, \n",
" ] \n",
" return get_completion_from_messages(messages)\n"
]
},
{
"cell_type": "markdown",
"id": "273db18f",
"metadata": {},
"source": [
"**5. 进一步评估**"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "f319f19f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
" [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}, {'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n"
]
}
],
"source": [
"customer_msg_3 = f\"\"\"\n",
"tell me about the smartx pro phone and the fotosnap camera, the dslr one.\n",
"Also, what TVs do you have?\"\"\"\n",
"\n",
"products_by_category_3 = find_category_and_product_v2(customer_msg_3,\n",
" products_and_category)\n",
"print(products_by_category_3)"
]
},
{
"cell_type": "markdown",
"id": "19430988",
"metadata": {},
"source": [
"**6. 回归测试**"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "1202f122",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
" [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n",
" \n"
]
}
],
"source": [
"customer_msg_0 = f\"\"\"Which TV can I buy if I'm on a budget?\"\"\"\n",
"\n",
"products_by_category_0 = find_category_and_product_v2(customer_msg_0,\n",
" products_and_category)\n",
"print(products_by_category_0)"
]
},
{
"cell_type": "markdown",
"id": "9c594222",
"metadata": {},
"source": [
"**7. 自动化测试**"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "24bd6ab4",
"metadata": {},
"outputs": [],
"source": [
"msg_ideal_pairs_set = [\n",
" \n",
" # eg 0\n",
" {'customer_msg':\"\"\"Which TV can I buy if I'm on a budget?\"\"\",\n",
" 'ideal_answer':{\n",
" 'Televisions and Home Theater Systems':set(\n",
" ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']\n",
" )}\n",
" },\n",
"\n",
" # eg 1\n",
" {'customer_msg':\"\"\"I need a charger for my smartphone\"\"\",\n",
" 'ideal_answer':{\n",
" 'Smartphones and Accessories':set(\n",
" ['MobiTech PowerCase', 'MobiTech Wireless Charger', 'SmartX EarBuds']\n",
" )}\n",
" },\n",
" # eg 2\n",
" {'customer_msg':f\"\"\"What computers do you have?\"\"\",\n",
" 'ideal_answer':{\n",
" 'Computers and Laptops':set(\n",
" ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook'\n",
" ])\n",
" }\n",
" },\n",
"\n",
" # eg 3\n",
" {'customer_msg':f\"\"\"tell me about the smartx pro phone and \\\n",
" the fotosnap camera, the dslr one.\\\n",
" Also, what TVs do you have?\"\"\",\n",
" 'ideal_answer':{\n",
" 'Smartphones and Accessories':set(\n",
" ['SmartX ProPhone']),\n",
" 'Cameras and Camcorders':set(\n",
" ['FotoSnap DSLR Camera']),\n",
" 'Televisions and Home Theater Systems':set(\n",
" ['CineView 4K TV', 'SoundMax Home Theater','CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV'])\n",
" }\n",
" }, \n",
" \n",
" # eg 4\n",
" {'customer_msg':\"\"\"tell me about the CineView TV, the 8K one, Gamesphere console, the X one.\n",
"I'm on a budget, what computers do you have?\"\"\",\n",
" 'ideal_answer':{\n",
" 'Televisions and Home Theater Systems':set(\n",
" ['CineView 8K TV']),\n",
" 'Gaming Consoles and Accessories':set(\n",
" ['GameSphere X']),\n",
" 'Computers and Laptops':set(\n",
" ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook'])\n",
" }\n",
" },\n",
" \n",
" # eg 5\n",
" {'customer_msg':f\"\"\"What smartphones do you have?\"\"\",\n",
" 'ideal_answer':{\n",
" 'Smartphones and Accessories':set(\n",
" ['SmartX ProPhone', 'MobiTech PowerCase', 'SmartX MiniPhone', 'MobiTech Wireless Charger', 'SmartX EarBuds'\n",
" ])\n",
" }\n",
" },\n",
" # eg 6\n",
" {'customer_msg':f\"\"\"I'm on a budget. Can you recommend some smartphones to me?\"\"\",\n",
" 'ideal_answer':{\n",
" 'Smartphones and Accessories':set(\n",
" ['SmartX EarBuds', 'SmartX MiniPhone', 'MobiTech PowerCase', 'SmartX ProPhone', 'MobiTech Wireless Charger']\n",
" )}\n",
" },\n",
"\n",
" # eg 7 # this will output a subset of the ideal answer\n",
" {'customer_msg':f\"\"\"What Gaming consoles would be good for my friend who is into racing games?\"\"\",\n",
" 'ideal_answer':{\n",
" 'Gaming Consoles and Accessories':set([\n",
" 'GameSphere X',\n",
" 'ProGamer Controller',\n",
" 'GameSphere Y',\n",
" 'ProGamer Racing Wheel',\n",
" 'GameSphere VR Headset'\n",
" ])}\n",
" },\n",
" # eg 8\n",
" {'customer_msg':f\"\"\"What could be a good present for my videographer friend?\"\"\",\n",
" 'ideal_answer': {\n",
" 'Cameras and Camcorders':set([\n",
" 'FotoSnap DSLR Camera', 'ActionCam 4K', 'FotoSnap Mirrorless Camera', 'ZoomMaster Camcorder', 'FotoSnap Instant Camera'\n",
" ])}\n",
" },\n",
" \n",
" # eg 9\n",
" {'customer_msg':f\"\"\"I would like a hot tub time machine.\"\"\",\n",
" 'ideal_answer': []\n",
" }\n",
" \n",
"]\n"
]
},
{
"cell_type": "markdown",
"id": "5e8fb19d",
"metadata": {},
"source": [
"**8. 与理想答案对比**"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "08432b46",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"def eval_response_with_ideal(response,\n",
" ideal,\n",
" debug=False):\n",
" \"\"\"\n",
" 评估回复是否与理想答案匹配\n",
" \n",
" 参数:\n",
" response: 回复的内容\n",
" ideal: 理想的答案\n",
" debug: 是否打印调试信息\n",
" \"\"\"\n",
" if debug:\n",
" print(\"回复:\")\n",
" print(response)\n",
" \n",
" # json.loads() 只能解析双引号,因此此处将单引号替换为双引号\n",
" json_like_str = response.replace(\"'\",'\"')\n",
" \n",
" # 解析为一系列的字典\n",
" l_of_d = json.loads(json_like_str)\n",
" \n",
" # 当响应为空,即没有找到任何商品时\n",
" if l_of_d == [] and ideal == []:\n",
" return 1\n",
" \n",
" # 另外一种异常情况是,标准答案数量与回复答案数量不匹配\n",
" elif l_of_d == [] or ideal == []:\n",
" return 0\n",
" \n",
" # 统计正确答案数量\n",
" correct = 0 \n",
" \n",
" if debug:\n",
" print(\"l_of_d is\")\n",
" print(l_of_d)\n",
"\n",
" # 对每一个问答对 \n",
" for d in l_of_d:\n",
"\n",
" # 获取产品和目录\n",
" cat = d.get('category')\n",
" prod_l = d.get('products')\n",
" # 有获取到产品和目录\n",
" if cat and prod_l:\n",
" # convert list to set for comparison\n",
" prod_set = set(prod_l)\n",
" # get ideal set of products\n",
" ideal_cat = ideal.get(cat)\n",
" if ideal_cat:\n",
" prod_set_ideal = set(ideal.get(cat))\n",
" else:\n",
" if debug:\n",
" print(f\"没有在标准答案中找到目录 {cat}\")\n",
" print(f\"标准答案: {ideal}\")\n",
" continue\n",
" \n",
" if debug:\n",
" print(\"产品集合:\\n\",prod_set)\n",
" print()\n",
" print(\"标准答案的产品集合:\\n\",prod_set_ideal)\n",
"\n",
" # 查找到的产品集合和标准的产品集合一致\n",
" if prod_set == prod_set_ideal:\n",
" if debug:\n",
" print(\"正确\")\n",
" correct +=1\n",
" else:\n",
" print(\"错误\")\n",
" print(f\"产品集合: {prod_set}\")\n",
" print(f\"标准的产品集合: {prod_set_ideal}\")\n",
" if prod_set <= prod_set_ideal:\n",
" print(\"回答是标准答案的一个子集\")\n",
" elif prod_set >= prod_set_ideal:\n",
" print(\"回答是标准答案的一个超集\")\n",
"\n",
" # 计算正确答案数\n",
" pc_correct = correct / len(l_of_d)\n",
" \n",
" return pc_correct"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "c3a632fc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"用户提问: What Gaming consoles would be good for my friend who is into racing games?\n",
"标准答案: {'Gaming Consoles and Accessories': {'ProGamer Racing Wheel', 'ProGamer Controller', 'GameSphere Y', 'GameSphere VR Headset', 'GameSphere X'}}\n"
]
}
],
"source": [
"print(f'用户提问: {msg_ideal_pairs_set[7][\"customer_msg\"]}')\n",
"print(f'标准答案: {msg_ideal_pairs_set[7][\"ideal_answer\"]}')"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "0df36913",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"回答: \n",
" [{'category': 'Gaming Consoles and Accessories', 'products': ['GameSphere X', 'ProGamer Controller', 'GameSphere Y', 'ProGamer Racing Wheel', 'GameSphere VR Headset']}]\n",
" \n"
]
},
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"response = find_category_and_product_v2(msg_ideal_pairs_set[7][\"customer_msg\"],\n",
" products_and_category)\n",
"print(f'回答: {response}')\n",
"\n",
"eval_response_with_ideal(response,\n",
" msg_ideal_pairs_set[7][\"ideal_answer\"])"
]
},
{
"cell_type": "markdown",
"id": "94c0ffd4",
"metadata": {},
"source": [
"**9. 计算正确比例**"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "c87f9cfb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"示例 0\n",
"0: 1.0\n",
"示例 1\n",
"错误\n",
"产品集合: {'MobiTech Wireless Charger', 'SmartX EarBuds', 'SmartX MiniPhone', 'SmartX ProPhone', 'MobiTech PowerCase'}\n",
"标准的产品集合: {'MobiTech Wireless Charger', 'SmartX EarBuds', 'MobiTech PowerCase'}\n",
"回答是标准答案的一个超集\n",
"1: 0.0\n",
"示例 2\n",
"2: 1.0\n",
"示例 3\n",
"3: 1.0\n",
"示例 4\n",
"错误\n",
"产品集合: {'SoundMax Home Theater', 'CineView 8K TV', 'CineView 4K TV', 'CineView OLED TV', 'SoundMax Soundbar'}\n",
"标准的产品集合: {'CineView 8K TV'}\n",
"回答是标准答案的一个超集\n",
"错误\n",
"产品集合: {'ProGamer Racing Wheel', 'ProGamer Controller', 'GameSphere Y', 'GameSphere VR Headset', 'GameSphere X'}\n",
"标准的产品集合: {'GameSphere X'}\n",
"回答是标准答案的一个超集\n",
"4: 0.3333333333333333\n",
"示例 5\n",
"5: 1.0\n",
"示例 6\n",
"6: 1.0\n",
"示例 7\n",
"7: 1.0\n",
"示例 8\n",
"8: 1.0\n",
"示例 9\n",
"9: 1\n",
"正确比例为 10: 0.8333333333333334\n"
]
}
],
"source": [
"import time\n",
"\n",
"score_accum = 0\n",
"for i, pair in enumerate(msg_ideal_pairs_set):\n",
" time.sleep(20)\n",
" print(f\"示例 {i}\")\n",
" \n",
" customer_msg = pair['customer_msg']\n",
" ideal = pair['ideal_answer']\n",
" \n",
" # print(\"Customer message\",customer_msg)\n",
" # print(\"ideal:\",ideal)\n",
" response = find_category_and_product_v2(customer_msg,\n",
" products_and_category)\n",
"\n",
" \n",
" # print(\"products_by_category\",products_by_category)\n",
" score = eval_response_with_ideal(response,ideal,debug=False)\n",
" print(f\"{i}: {score}\")\n",
" score_accum += score\n",
" \n",
"\n",
"n_examples = len(msg_ideal_pairs_set)\n",
"fraction_correct = score_accum / n_examples\n",
"print(f\"正确比例为 {n_examples}: {fraction_correct}\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}