diff --git a/content/Building Systems with the ChatGPT API/8.Evaluation-part1.ipynb b/content/Building Systems with the ChatGPT API/8.Evaluation-part1.ipynb index c11b09a..6541890 100644 --- a/content/Building Systems with the ChatGPT API/8.Evaluation-part1.ipynb +++ b/content/Building Systems with the ChatGPT API/8.Evaluation-part1.ipynb @@ -14,31 +14,17 @@ { "attachments": {}, "cell_type": "markdown", - "id": "58449d96", + "id": "c768620b", "metadata": {}, "source": [ "在之前的几个视频中,我们展示了如何使用llm构建应用程序,包括从评估输入到处理输入再到在向用户显示输出之前进行最终输出检查。\n", "\n", "构建这样的系统后,如何知道它的工作情况?甚至在部署并让用户使用它时,如何跟踪它的运行情况并发现任何缺陷并继续改进系统的答案质量?\n", "\n", - "在这个视频中,我想与您分享一些最佳实践,用于评估llm的输出。\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "85791cef", - "metadata": {}, - "source": [ - "构建基于LLM的应用程序与传统监督学习应用程序之间的区别在于,因为您可以快速构建这样的应用程序,评估它的方法,通常不会从测试集开始。相反,您经常会逐渐建立一组测试示例。" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "2bf010ed", - "metadata": {}, - "source": [ + "在这个视频中,我想与您分享一些最佳实践,用于评估llm的输出。\n", + "\n", + "构建基于LLM的应用程序与传统监督学习应用程序之间的区别在于,因为您可以快速构建这样的应用程序,评估它的方法,通常不会从测试集开始。相反,您经常会逐渐建立一组测试示例。\n", + "\n", "在传统的监督学习环境中,收集一个训练集、开发集或保留交叉验证集,然后在整个开发过程中使用它们。\n", "\n", "但是如果你能够在几分钟内指定一个提示,并在几个小时内得到一些工作成果,那么如果你不得不暂停很长时间收集一千个测试样本,那将会是一个巨大的痛苦,因为现在你可以在零个训练样本的情况下得到这个工作成果。\n", @@ -50,22 +36,15 @@ "然后,当系统进行额外的测试时,你偶尔会遇到一些棘手的例子。提示在它们身上不起作用,或者算法在它们身上不起作用。\n", "\n", "这就是使用chatgpt api的开发者如何构建应用程序的过程。\n", - " \n", + "\n", "在这种情况下,您可以将这些额外的一个或两个或三个或五个示例添加到您正在测试的集合中,以机会主义地添加其他棘手的示例。\n", "\n", "最终,您已经添加了足够的这些示例到您缓慢增长的开发集中,它变得有点不方便通过提示手动运行每个示例。\n", "\n", "然后,您开始开发在这些小示例集上用于衡量性能的指标,例如平均准确性。\n", "\n", - "这个过程的一个有趣方面是如果您随时决定您的系统已经足够好了,你可以停在那里不用改进它。事实上,有许多部署应用程序停在第一或第二个步骤,并且运行得非常好。\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "3fac7abc", - "metadata": {}, - "source": [ + "这个过程的一个有趣方面是如果您随时决定您的系统已经足够好了,你可以停在那里不用改进它。事实上,有许多部署应用程序停在第一或第二个步骤,并且运行得非常好。\n", + "\n", "一个重要的警告是,有很多大模型的应用程序没有实质性的风险,即使它没有给出完全正确的答案。\n", "\n", "但是,对于部分高风险应用,如果存在偏见或不适当的输出的风险可能对某人造成伤害,那么收集测试集的责任、严格评估系统的性能、确保在使用之前它能够做正确的事情,这变得更加重要。\n", @@ -90,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 1, "id": "a9726b15", "metadata": { "height": 166 @@ -103,10 +82,9 @@ "import time\n", "sys.path.append('../..')\n", "import utils_en\n", - "from dotenv import load_dotenv, find_dotenv\n", - "_ = load_dotenv(find_dotenv()) # 读取本地.env文件\n", + "import utils_ch\n", "\n", - "openai.api_key = os.environ['OPENAI_API_KEY']" + "openai.api_key = \"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\"" ] }, { @@ -144,7 +122,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 5, "id": "6f4062ea", "metadata": { "height": 47 @@ -185,7 +163,7 @@ " 'FotoSnap Instant Camera']}" ] }, - "execution_count": 3, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -265,6 +243,55 @@ " return get_completion_from_messages(messages)\n" ] }, + { + "cell_type": "code", + "execution_count": 3, + "id": "ac683bfb", + "metadata": {}, + "outputs": [], + "source": [ + "'''中文Prompt'''\n", + "def find_category_and_product_v1(user_input,products_and_category):\n", + "\n", + " delimiter = \"####\"\n", + " system_message = f\"\"\"\n", + " 您将提供客户服务查询。\\\n", + " 客户服务查询将用{delimiter}字符分隔。\n", + " 输出一个python列表,列表中的每个对象都是json对象,每个对象的格式如下:\n", + " 'category': ,\n", + " 以及\n", + " 'products': <必须在下面允许的产品中找到的产品列表>\n", + " \n", + " 其中类别和产品必须在客户服务查询中找到。\n", + " 如果提到了一个产品,它必须与下面允许的产品列表中的正确类别关联。\n", + " 如果没有找到产品或类别,输出一个空列表。\n", + " \n", + " 根据产品名称和产品类别与客户服务查询的相关性,列出所有相关的产品。\n", + " 不要从产品的名称中假设任何特性或属性,如相对质量或价格。\n", + " \n", + " 允许的产品以JSON格式提供。\n", + " 每个项目的键代表类别。\n", + " 每个项目的值是该类别中的产品列表。\n", + " 允许的产品:{products_and_category}\n", + " \n", + " \"\"\"\n", + " \n", + " few_shot_user_1 = \"\"\"我想要最贵的电脑。\"\"\"\n", + " few_shot_assistant_1 = \"\"\" \n", + " [{'category': 'Computers and Laptops', \\\n", + "'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n", + " \"\"\"\n", + " \n", + " messages = [ \n", + " {'role':'system', 'content': system_message}, \n", + " {'role':'user', 'content': f\"{delimiter}{few_shot_user_1}{delimiter}\"}, \n", + " {'role':'assistant', 'content': few_shot_assistant_1 },\n", + " {'role':'user', 'content': f\"{delimiter}{user_input}{delimiter}\"}, \n", + " ] \n", + " return get_completion_from_messages(messages)" + ] + }, { "attachments": {}, "cell_type": "markdown", @@ -390,7 +417,114 @@ { "attachments": {}, "cell_type": "markdown", - "id": "b8e7d25e", + "id": "f430fa3f", + "metadata": {}, + "source": [ + "中文Prompt评估" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "cacb96b2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'SoundMax Soundbar', 'CineView OLED TV']}]\n" + ] + } + ], + "source": [ + "# 第一个评估的查询\n", + "customer_msg_0 = f\"\"\"如果我预算有限,我可以买哪款电视?\"\"\"\n", + "\n", + "products_by_category_0 = find_category_and_product_v1(customer_msg_0,\n", + " products_and_category)\n", + "print(products_by_category_0)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "04364405", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " [{'category': 'Smartphones and Accessories', 'products': ['MobiTech PowerCase', 'MobiTech Wireless Charger', 'SmartX EarBuds']}]\n", + "\n" + ] + } + ], + "source": [ + "customer_msg_1 = f\"\"\"我需要一个智能手机的充电器\"\"\"\n", + "\n", + "products_by_category_1 = find_category_and_product_v1(customer_msg_1,\n", + " products_and_category)\n", + "print(products_by_category_1)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "66e9ecd0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\" [{'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\"" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "customer_msg_2 = f\"\"\"\n", + "你们有哪些电脑?\"\"\"\n", + "\n", + "products_by_category_2 = find_category_and_product_v1(customer_msg_2,\n", + " products_and_category)\n", + "products_by_category_2" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "112cfd5f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}]\n", + " \n", + " {'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}\n" + ] + } + ], + "source": [ + "customer_msg_3 = f\"\"\"\n", + "告诉我关于smartx pro手机和fotosnap相机的信息,那款DSLR的。\n", + "另外,你们有哪些电视?\"\"\"\n", + "\n", + "products_by_category_3 = find_category_and_product_v1(customer_msg_3,\n", + " products_and_category)\n", + "print(products_by_category_3)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "d58f15be", "metadata": {}, "source": [ "它看起来像是输出了正确的数据,但它也输出了一堆文本,这些是多余的。这使得将其解析为Python字典列表更加困难。" @@ -441,6 +575,32 @@ "print(products_by_category_4)" ] }, + { + "cell_type": "code", + "execution_count": 10, + "id": "5b11172f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 8K TV']}, {'category': 'Gaming Consoles and Accessories', 'products': ['GameSphere X']}, {'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n", + " \n", + " 具体来说,CineView 8K电视是一款高端电视,具有8K分辨率和OLED显示屏。GameSphere X是一款游戏机,具有高性能和多种游戏选择。对于预算有限的电脑,您可以考虑TechPro Chromebook或TechPro Ultrabook,它们都是较为经济实惠的选择。\n" + ] + } + ], + "source": [ + "'''中文Prompt'''\n", + "customer_msg_4 = f\"\"\"\n", + "告诉我关于CineView电视的信息,那款8K的,还有Gamesphere游戏机,X款的。\n", + "我预算有限,你们有哪些电脑?\"\"\"\n", + "\n", + "products_by_category_4 = find_category_and_product_v1(customer_msg_4,products_and_category)\n", + "print(products_by_category_4)" + ] + }, { "attachments": {}, "cell_type": "markdown", @@ -455,7 +615,7 @@ { "attachments": {}, "cell_type": "markdown", - "id": "3c2ac3ae", + "id": "ddcee6a5", "metadata": {}, "source": [ "我们在提示中添加了以下内容,不要输出任何不在JSON格式中的附加文本,并添加了第二个示例,使用用户和助手消息进行few-shot提示。" @@ -529,6 +689,67 @@ " return get_completion_from_messages(messages)\n" ] }, + { + "cell_type": "code", + "execution_count": 11, + "id": "d3b183bf", + "metadata": {}, + "outputs": [], + "source": [ + "def find_category_and_product_v2(user_input,products_and_category):\n", + " \"\"\"\n", + " 添加:不输出任何不是JSON格式的额外文本。\n", + " 添加了第二个例子(用于少数提示),用户询问最便宜的电脑。在两个少数提示的例子中,显示的响应只是产品列表的JSON格式。\n", + " \"\"\"\n", + " delimiter = \"####\"\n", + " system_message = f\"\"\"\n", + " 您将提供客户服务查询。\\\n", + " 客户服务查询将用{delimiter}字符分隔。\n", + " 输出一个python列表,列表中的每个对象都是json对象,每个对象的格式如下:\n", + " 'category': ,\n", + " AND\n", + " 'products': <必须在下面允许的产品中找到的产品列表>\n", + " 不要输出任何不是JSON格式的额外文本。\n", + " 输出请求的JSON后,不要写任何解释性的文本。\n", + " \n", + " 其中类别和产品必须在客户服务查询中找到。\n", + " 如果提到了一个产品,它必须与下面允许的产品列表中的正确类别关联。\n", + " 如果没有找到产品或类别,输出一个空列表。\n", + " \n", + " 根据产品名称和产品类别与客户服务查询的相关性,列出所有相关的产品。\n", + " 不要从产品的名称中假设任何特性或属性,如相对质量或价格。\n", + " \n", + " 允许的产品以JSON格式提供。\n", + " 每个项目的键代表类别。\n", + " 每个项目的值是该类别中的产品列表。\n", + " 允许的产品:{products_and_category}\n", + " \n", + " \"\"\"\n", + " \n", + " few_shot_user_1 = \"\"\"我想要最贵的电脑。你推荐哪款?\"\"\"\n", + " few_shot_assistant_1 = \"\"\" \n", + " [{'category': 'Computers and Laptops', \\\n", + "'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n", + " \"\"\"\n", + " \n", + " few_shot_user_2 = \"\"\"我想要最便宜的电脑。你推荐哪款?\"\"\"\n", + " few_shot_assistant_2 = \"\"\" \n", + " [{'category': 'Computers and Laptops', \\\n", + "'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n", + " \"\"\"\n", + " \n", + " messages = [ \n", + " {'role':'system', 'content': system_message}, \n", + " {'role':'user', 'content': f\"{delimiter}{few_shot_user_1}{delimiter}\"}, \n", + " {'role':'assistant', 'content': few_shot_assistant_1 },\n", + " {'role':'user', 'content': f\"{delimiter}{few_shot_user_2}{delimiter}\"}, \n", + " {'role':'assistant', 'content': few_shot_assistant_2 },\n", + " {'role':'user', 'content': f\"{delimiter}{user_input}{delimiter}\"}, \n", + " ] \n", + " return get_completion_from_messages(messages)" + ] + }, { "attachments": {}, "cell_type": "markdown", @@ -567,6 +788,31 @@ "print(products_by_category_3)" ] }, + { + "cell_type": "code", + "execution_count": 12, + "id": "4a547b34", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}, {'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n", + "\n" + ] + } + ], + "source": [ + "customer_msg_3 = f\"\"\"\n", + "告诉我关于smartx pro手机和fotosnap相机的信息,那款DSLR的。\n", + "另外,你们有哪些电视?\"\"\"\n", + "\n", + "products_by_category_3 = find_category_and_product_v2(customer_msg_3,\n", + " products_and_category)\n", + "print(products_by_category_3)" + ] + }, { "attachments": {}, "cell_type": "markdown", @@ -605,6 +851,32 @@ "print(products_by_category_0)" ] }, + { + "cell_type": "code", + "execution_count": 13, + "id": "b5ba773b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " \n", + "\n", + " [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n", + " \n", + " 如果您的预算有限,我们建议您购买CineView 4K电视或SoundMax家庭影院。\n" + ] + } + ], + "source": [ + "customer_msg_0 = f\"\"\"如果我预算有限,我可以买哪款电视?\"\"\"\n", + "\n", + "products_by_category_0 = find_category_and_product_v2(customer_msg_0,\n", + " products_and_category)\n", + "print(products_by_category_0)" + ] + }, { "attachments": {}, "cell_type": "markdown", @@ -619,7 +891,7 @@ { "attachments": {}, "cell_type": "markdown", - "id": "20817d3b", + "id": "2af63218", "metadata": {}, "source": [ "当你要调整的开发集不仅仅是一小部分示例时,开始自动化测试过程就变得有用了。" @@ -743,7 +1015,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 16, "id": "d9530285", "metadata": {}, "outputs": [], @@ -881,6 +1153,40 @@ " msg_ideal_pairs_set[7][\"ideal_answer\"])" ] }, + { + "cell_type": "code", + "execution_count": 17, + "id": "bb7f5a2f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "回答: [{'category': 'Gaming Consoles and Accessories', 'products': ['GameSphere X', 'ProGamer Controller', 'GameSphere Y', 'ProGamer Racing Wheel', 'GameSphere VR Headset']}]\n" + ] + }, + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'''调用中文Prompt'''\n", + "response = find_category_and_product_v2(msg_ideal_pairs_set[7][\"customer_msg\"],\n", + " products_and_category)\n", + "print(f'回答: {response}')\n", + "\n", + "eval_response_with_ideal(response,\n", + " msg_ideal_pairs_set[7][\"ideal_answer\"])" + ] + }, { "attachments": {}, "cell_type": "markdown", @@ -963,7 +1269,7 @@ { "attachments": {}, "cell_type": "markdown", - "id": "ba1218f0", + "id": "5d885db6", "metadata": {}, "source": [ "使用提示构建应用程序的工作流程与使用监督学习构建应用程序的工作流程非常不同。\n", @@ -974,8 +1280,7 @@ "\n", "这对于帮助你和你的团队找到有效的提示和有效的系统非常有帮助。\n", "\n", - "在这个视频中,输出可以定量评估,就像有一个期望的输出一样,你可以判断它是否给出了这个期望的输出。因此,在下一个视频中,让我们看看如何在这种更加模糊的情况下评估我们的输出。在那种情况下,什么是正确答案是有点模糊的。\n", - "\n" + "在这个视频中,输出可以定量评估,就像有一个期望的输出一样,你可以判断它是否给出了这个期望的输出。因此,在下一个视频中,让我们看看如何在这种更加模糊的情况下评估我们的输出。在那种情况下,什么是正确答案是有点模糊的。" ] } ],