Files
prompt-engineering-for-deve…/content/Prompt Engineering for Developer/4. 文本概括 Summarizing.ipynb
2023-07-14 18:32:52 +08:00

743 lines
23 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "b58204ea",
"metadata": {},
"source": [
"# 第四章 文本概括"
]
},
{
"cell_type": "markdown",
"id": "a190d6a1",
"metadata": {},
"source": [
"<div class=\"toc\">\n",
" <ul class=\"toc-item\">\n",
" <li><span><a href=\"#一引言\" data-toc-modified-id=\"一、引言\">一、引言</a></span></li>\n",
" <li>\n",
" <span><a href=\"#二单一文本概括\" data-toc-modified-id=\"二、单一文本概括实验\">二、单一文本概括实验</a></span>\n",
" <ul class=\"toc-item\">\n",
" <li><span><a href=\"#21-限制输出文本长度\" data-toc-modified-id=\"2.1 限制输出文本长度\">2.1 限制输出文本长度</a></span></li> \n",
" <li><span><a href=\"#22-设置关键角度侧重\" data-toc-modified-id=\"2.2 设置关键角度侧重\">2.2 设置关键角度侧重</a></span></li>\n",
" <li><span><a href=\"#23-关键信息提取\" data-toc-modified-id=\"2.3 关键信息提取\">2.3 关键信息提取</a></span></li>\n",
" </ul>\n",
" </li>\n",
" <li><span><a href=\"#三同时概括多条文本\" data-toc-modified-id=\"三、同时概括多条文本\">三、同时概括多条文本</a></span></li>\n",
" </ul>\n",
"</div>"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b70ad003",
"metadata": {},
"source": [
"## 一、引言"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "12fa9ea4",
"metadata": {},
"source": [
"当今世界上文本信息浩如烟海我们很难拥有足够的时间去阅读所有想了解的东西。但欣喜的是目前LLM在文本概括任务上展现了强大的水准也已经有不少团队将概括功能实现在多种应用中。\n",
"\n",
"本章节将介绍如何使用编程的方式调用API接口来实现“文本概括”功能。"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1de4fd1e",
"metadata": {},
"source": [
"首先,我们需要引入 OpenAI 包,加载 API 密钥,定义 getCompletion 函数。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b4bfa7f",
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"# 导入第三方库\n",
"\n",
"openai.api_key = \"sk-...\"\n",
"# 设置 API_KEY, 请替换成您自己的 API_KEY\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9f679f1f",
"metadata": {},
"outputs": [],
"source": [
"def get_completion(prompt, model=\"gpt-3.5-turbo\"): \n",
" messages = [{\"role\": \"user\", \"content\": prompt}]\n",
" response = openai.ChatCompletion.create(\n",
" model=model,\n",
" messages=messages,\n",
" temperature=0, # 值越低则输出文本随机性越低\n",
" )\n",
" return response.choices[0].message[\"content\"]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9cca835b",
"metadata": {},
"source": [
"## 二、单一文本概括"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0c1e1b92",
"metadata": {},
"source": [
"以商品评论的总结任务为例:对于电商平台来说,网站上往往存在着海量的商品评论,这些评论反映了所有客户的想法。如果我们拥有一个工具去概括这些海量、冗长的评论,便能够快速地浏览更多评论,洞悉客户的偏好,从而指导平台与商家提供更优质的服务。"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9dc2e2bc",
"metadata": {},
"source": [
"**输入文本**"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "4d9c0eeb",
"metadata": {},
"outputs": [],
"source": [
"prod_review = \"\"\"\n",
"Got this panda plush toy for my daughter's birthday, \\\n",
"who loves it and takes it everywhere. It's soft and \\ \n",
"super cute, and its face has a friendly look. It's \\ \n",
"a bit small for what I paid though. I think there \\ \n",
"might be other options that are bigger for the \\ \n",
"same price. It arrived a day earlier than expected, \\ \n",
"so I got to play with it myself before I gave it \\ \n",
"to her.\n",
"\"\"\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "aad5bd2a",
"metadata": {},
"source": [
"**输入文本(中文翻译)**"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "43b5dd25",
"metadata": {},
"outputs": [],
"source": [
"prod_review_zh = \"\"\"\n",
"这个熊猫公仔是我给女儿的生日礼物,她很喜欢,去哪都带着。\n",
"公仔很软,超级可爱,面部表情也很和善。但是相比于价钱来说,\n",
"它有点小,我感觉在别的地方用同样的价钱能买到更大的。\n",
"快递比预期提前了一天到货,所以在送给女儿之前,我自己玩了会。\n",
"\"\"\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "662c9cd2",
"metadata": {},
"source": [
"### 2.1 限制输出文本长度"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a6d10814",
"metadata": {},
"source": [
"我们尝试限制文本长度为最多30词。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "02208fbc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Soft and cute panda plush toy loved by daughter, but a bit small for the price. Arrived early.\n"
]
}
],
"source": [
"prompt = f\"\"\"\n",
"Your task is to generate a short summary of a product \\\n",
"review from an ecommerce site. \n",
"\n",
"Summarize the review below, delimited by triple \n",
"backticks, in at most 30 words. \n",
"\n",
"Review: ```{prod_review}```\n",
"\"\"\"\n",
"\n",
"response = get_completion(prompt)\n",
"print(response)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0df0eb90",
"metadata": {},
"source": [
"中文翻译版本"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "bf4b39f9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"可爱软熊猫公仔,女儿喜欢,面部表情和善,但价钱有点小贵,快递提前一天到货。\n"
]
}
],
"source": [
"prompt = f\"\"\"\n",
"您的任务是从电子商务网站上生成一个产品评论的简短摘要。\n",
"\n",
"请对三个反引号之间的评论文本进行概括最多30个词汇。\n",
"\n",
"评论: ```{prod_review_zh}```\n",
"\"\"\"\n",
"\n",
"response = get_completion(prompt)\n",
"print(response)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e9ab145e",
"metadata": {},
"source": [
"### 2.2 设置关键角度侧重"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f84d0123",
"metadata": {},
"source": [
"有时,针对不同的业务,我们对文本的侧重会有所不同。例如对于商品评论文本,物流会更关心运输时效,商家更加关心价格与商品质量,平台更关心整体服务体验。\n",
"\n",
"我们可以通过增加Prompt提示来体现对于某个特定角度的侧重。"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d6f8509a",
"metadata": {},
"source": [
"### 2.2.1 侧重于快递服务"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "9d8a32a6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The panda plush toy arrived a day earlier than expected, but the customer felt it was a bit small for the price paid.\n"
]
}
],
"source": [
"prompt = f\"\"\"\n",
"Your task is to generate a short summary of a product \\\n",
"review from an ecommerce site to give feedback to the \\\n",
"Shipping deparmtment. \n",
"\n",
"Summarize the review below, delimited by triple \n",
"backticks, in at most 30 words, and focusing on any aspects \\\n",
"that mention shipping and delivery of the product. \n",
"\n",
"Review: ```{prod_review}```\n",
"\"\"\"\n",
"\n",
"response = get_completion(prompt)\n",
"print(response)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0bd4243a",
"metadata": {},
"source": [
"中文翻译版本"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "80636c3e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"快递提前到货,熊猫公仔软可爱,但有点小,价钱不太划算。\n"
]
}
],
"source": [
"prompt = f\"\"\"\n",
"您的任务是从电子商务网站上生成一个产品评论的简短摘要。\n",
"\n",
"请对三个反引号之间的评论文本进行概括最多30个词汇并且聚焦在产品运输上。\n",
"\n",
"评论: ```{prod_review_zh}```\n",
"\"\"\"\n",
"\n",
"response = get_completion(prompt)\n",
"print(response)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "76c97fea",
"metadata": {},
"source": [
"可以看到,输出结果以“快递提前一天到货”开头,体现了对于快递效率的侧重。"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "83275907",
"metadata": {},
"source": [
"### 2.2.2 侧重于价格与质量"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "767f252c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The panda plush toy is soft, cute, and loved by the recipient, but the price may be too high for its size compared to other options.\n"
]
}
],
"source": [
"prompt = f\"\"\"\n",
"Your task is to generate a short summary of a product \\\n",
"review from an ecommerce site to give feedback to the \\\n",
"pricing deparmtment, responsible for determining the \\\n",
"price of the product. \n",
"\n",
"Summarize the review below, delimited by triple \n",
"backticks, in at most 30 words, and focusing on any aspects \\\n",
"that are relevant to the price and perceived value. \n",
"\n",
"Review: ```{prod_review}```\n",
"\"\"\"\n",
"\n",
"response = get_completion(prompt)\n",
"print(response)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "cf54fac4",
"metadata": {},
"source": [
"中文翻译版本"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "728d6c57",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"可爱软熊猫公仔,面部表情友好,但价钱有点高,尺寸较小。快递提前一天到货。\n"
]
}
],
"source": [
"prompt = f\"\"\"\n",
"您的任务是从电子商务网站上生成一个产品评论的简短摘要。\n",
"\n",
"请对三个反引号之间的评论文本进行概括最多30个词汇并且聚焦在产品价格和质量上。\n",
"\n",
"评论: ```{prod_review_zh}```\n",
"\"\"\"\n",
"\n",
"response = get_completion(prompt)\n",
"print(response)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "972dbb1b",
"metadata": {},
"source": [
"可以看到,输出结果以“质量好、价格小贵、尺寸小”开头,体现了对于产品价格与质量的侧重。"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b3ed53d2",
"metadata": {},
"source": [
"### 2.3 关键信息提取"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ba6f5c25",
"metadata": {},
"source": [
"在2.2节中,虽然我们通过添加关键角度侧重的 Prompt ,使得文本摘要更侧重于某一特定方面,但是可以发现,结果中也会保留一些其他信息,如偏重价格与质量角度的概括中仍保留了“快递提前到货”的信息。如果我们只想要提取某一角度的信息,并过滤掉其他所有信息,则可以要求 LLM 进行“文本提取( Extract )”而非“概括( Summarize )”"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "2d60dc58",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\"The product arrived a day earlier than expected.\"\n"
]
}
],
"source": [
"prompt = f\"\"\"\n",
"Your task is to extract relevant information from \\ \n",
"a product review from an ecommerce site to give \\\n",
"feedback to the Shipping department. \n",
"\n",
"From the review below, delimited by triple quotes \\\n",
"extract the information relevant to shipping and \\ \n",
"delivery. Limit to 30 words. \n",
"\n",
"Review: ```{prod_review}```\n",
"\"\"\"\n",
"\n",
"response = get_completion(prompt)\n",
"print(response)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0339b877",
"metadata": {},
"source": [
"中文翻译版本"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "c845ccab",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"快递比预期提前了一天到货。\n"
]
}
],
"source": [
"prompt = f\"\"\"\n",
"您的任务是从电子商务网站上的产品评论中提取相关信息。\n",
"\n",
"请从以下三个反引号之间的评论文本中提取产品运输相关的信息最多30个词汇。\n",
"\n",
"评论: ```{prod_review_zh}```\n",
"\"\"\"\n",
"\n",
"response = get_completion(prompt)\n",
"print(response)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "50498a2b",
"metadata": {},
"source": [
"## 三、同时概括多条文本"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a291541a",
"metadata": {},
"source": [
"在实际的工作流中,我们往往有许许多多的评论文本,以下示例将多条用户评价放进列表,并利用 ```for``` 循环使用文本概括Summarize提示词将评价概括至小于 20 词,并按顺序打印。当然,在实际生产中,对于不同规模的评论文本,除了使用 ```for``` 循环以外,还可能需要考虑整合评论、分布式等方法提升运算效率。您可以搭建主控面板,来总结大量用户评论,来方便您或他人快速浏览,还可以点击查看原评论。这样您能高效掌握顾客的所有想法。"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ee7caa78",
"metadata": {},
"outputs": [],
"source": [
"review_1 = prod_review \n",
"\n",
"# review for a standing lamp\n",
"review_2 = \"\"\"\n",
"Needed a nice lamp for my bedroom, and this one \\\n",
"had additional storage and not too high of a price \\\n",
"point. Got it fast - arrived in 2 days. The string \\\n",
"to the lamp broke during the transit and the company \\\n",
"happily sent over a new one. Came within a few days \\\n",
"as well. It was easy to put together. Then I had a \\\n",
"missing part, so I contacted their support and they \\\n",
"very quickly got me the missing piece! Seems to me \\\n",
"to be a great company that cares about their customers \\\n",
"and products. \n",
"\"\"\"\n",
"\n",
"# review for an electric toothbrush\n",
"review_3 = \"\"\"\n",
"My dental hygienist recommended an electric toothbrush, \\\n",
"which is why I got this. The battery life seems to be \\\n",
"pretty impressive so far. After initial charging and \\\n",
"leaving the charger plugged in for the first week to \\\n",
"condition the battery, I've unplugged the charger and \\\n",
"been using it for twice daily brushing for the last \\\n",
"3 weeks all on the same charge. But the toothbrush head \\\n",
"is too small. Ive seen baby toothbrushes bigger than \\\n",
"this one. I wish the head was bigger with different \\\n",
"length bristles to get between teeth better because \\\n",
"this one doesnt. Overall if you can get this one \\\n",
"around the $50 mark, it's a good deal. The manufactuer's \\\n",
"replacements heads are pretty expensive, but you can \\\n",
"get generic ones that're more reasonably priced. This \\\n",
"toothbrush makes me feel like I've been to the dentist \\\n",
"every day. My teeth feel sparkly clean! \n",
"\"\"\"\n",
"\n",
"# review for a blender\n",
"review_4 = \"\"\"\n",
"So, they still had the 17 piece system on seasonal \\\n",
"sale for around $49 in the month of November, about \\\n",
"half off, but for some reason (call it price gouging) \\\n",
"around the second week of December the prices all went \\\n",
"up to about anywhere from between $70-$89 for the same \\\n",
"system. And the 11 piece system went up around $10 or \\\n",
"so in price also from the earlier sale price of $29. \\\n",
"So it looks okay, but if you look at the base, the part \\\n",
"where the blade locks into place doesnt look as good \\\n",
"as in previous editions from a few years ago, but I \\\n",
"plan to be very gentle with it (example, I crush \\\n",
"very hard items like beans, ice, rice, etc. in the \\\n",
"blender first then pulverize them in the serving size \\\n",
"I want in the blender then switch to the whipping \\\n",
"blade for a finer flour, and use the cross cutting blade \\\n",
"first when making smoothies, then use the flat blade \\\n",
"if I need them finer/less pulpy). Special tip when making \\\n",
"smoothies, finely cut and freeze the fruits and \\\n",
"vegetables (if using spinach-lightly stew soften the \\\n",
"spinach then freeze until ready for use-and if making \\\n",
"sorbet, use a small to medium sized food processor) \\\n",
"that you plan to use that way you can avoid adding so \\\n",
"much ice if at all-when making your smoothie. \\\n",
"After about a year, the motor was making a funny noise. \\\n",
"I called customer service but the warranty expired \\\n",
"already, so I had to buy another one. FYI: The overall \\\n",
"quality has gone done in these types of products, so \\\n",
"they are kind of counting on brand recognition and \\\n",
"consumer loyalty to maintain sales. Got it in about \\\n",
"two days.\n",
"\"\"\"\n",
"\n",
"reviews = [review_1, review_2, review_3, review_4]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "9d1aa5ac",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 Soft and cute panda plush toy loved by daughter, but a bit small for the price. Arrived early. \n",
"\n",
"1 Affordable lamp with storage, fast shipping, and excellent customer service. Easy to assemble and missing parts were quickly replaced. \n",
"\n",
"2 Good battery life, small toothbrush head, but effective cleaning. Good deal if bought around $50. \n",
"\n",
"3 The product was on sale for $49 in November, but the price increased to $70-$89 in December. The base doesn't look as good as previous editions, but the reviewer plans to be gentle with it. A special tip for making smoothies is to freeze the fruits and vegetables beforehand. The motor made a funny noise after a year, and the warranty had expired. Overall quality has decreased. \n",
"\n"
]
}
],
"source": [
"for i in range(len(reviews)):\n",
" prompt = f\"\"\"\n",
" Your task is to generate a short summary of a product \\\n",
" review from an ecommerce site. \n",
"\n",
" Summarize the review below, delimited by triple \\\n",
" backticks in at most 20 words. \n",
"\n",
" Review: ```{reviews[i]}```\n",
" \"\"\"\n",
" response = get_completion(prompt)\n",
" print(i, response, \"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eb878522",
"metadata": {},
"outputs": [],
"source": [
"for i in range(len(reviews)):\n",
" prompt = f\"\"\"\n",
" 你的任务是从电子商务网站上的产品评论中提取相关信息。\n",
"\n",
" 请对三个反引号之间的评论文本进行概括最多20个词汇。\n",
"\n",
" 评论文本: ```{reviews[i]}```\n",
" \"\"\"\n",
" response = get_completion(prompt)\n",
" print(i, response, \"\\n\")\n"
]
},
{
"cell_type": "markdown",
"id": "d757b389",
"metadata": {},
"source": [
"0 概括:可爱的熊猫毛绒玩具,质量好,送货快,但有点小。 \n",
"\n",
"1 这个评论是关于一款具有额外储存空间的床头灯,价格适中。客户对公司的服务和产品表示满意。 \n",
"\n",
"2 评论概括:电动牙刷电池寿命长,但刷头太小,需要更长的刷毛。价格合理,使用后牙齿感觉干净。 \n",
"\n",
"3 评论概括产品价格在12月份上涨质量不如以前但交付速度快。 "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": false,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": true
}
},
"nbformat": 4,
"nbformat_minor": 5
}