From 2c81dcdb3ff3eaade5a968fb58e26e32fa601077 Mon Sep 17 00:00:00 2001 From: huangyulin Date: Fri, 28 Apr 2023 22:16:13 +0800 Subject: [PATCH] =?UTF-8?q?=E6=96=B0=E5=A2=9E=EF=BC=9A=E6=96=87=E6=9C=AC?= =?UTF-8?q?=E6=A6=82=E6=8B=AC?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- content/文本概括 Summarizing.ipynb | 616 +++++++++++++++++++++++++++++ 1 file changed, 616 insertions(+) create mode 100644 content/文本概括 Summarizing.ipynb diff --git a/content/文本概括 Summarizing.ipynb b/content/文本概括 Summarizing.ipynb new file mode 100644 index 0000000..9e5b7db --- /dev/null +++ b/content/文本概括 Summarizing.ipynb @@ -0,0 +1,616 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "716d29bd", + "metadata": {}, + "source": [ + "# 文本概括 Summarizing" + ] + }, + { + "cell_type": "markdown", + "id": "f6b7fca0", + "metadata": {}, + "source": [ + "## 1 引言" + ] + }, + { + "cell_type": "markdown", + "id": "4b819ee9", + "metadata": {}, + "source": [ + "当今世界上有太多的文本信息,几乎没有人能够拥有足够的时间去阅读所有我们想了解的东西。但令人感到欣喜的是,目前LLM在文本概括任务上展现了强大的水准,也已经有不少团队将这项功能插入了自己的软件应用中。\n", + "\n", + "本章节将介绍如何使用编程的方式,调用API接口来实现“文本概括”功能。" + ] + }, + { + "cell_type": "markdown", + "id": "48b35259", + "metadata": {}, + "source": [ + "首先,我们需要OpenAI包,加载API密钥,定义getCompletion函数。" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "5ab873ff", + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "openai.api_key = 'sk-请输入你的api-key'\n", + "\n", + "def get_completion(prompt, model=\"gpt-3.5-turbo\"): \n", + " messages = [{\"role\": \"user\", \"content\": prompt}]\n", + " response = openai.ChatCompletion.create(\n", + " model=model,\n", + " messages=messages,\n", + " temperature=0, # 值越低则输出文本随机性越低\n", + " )\n", + " return response.choices[0].message[\"content\"]" + ] + }, + { + "cell_type": "markdown", + "id": "469f1276", + "metadata": {}, + "source": [ + "## 2 单一文本概括Prompt实验" + ] + }, + { + "cell_type": "markdown", + "id": "cd7d8580", + "metadata": {}, + "source": [ + "这里我们举了个商品评论的例子。对于电商平台来说,网站上往往存在着海量的商品评论,这些评论反映了所有客户的想法。如果我们拥有一个工具去概括这些海量、冗长的评论,便能够快速地浏览更多评论,洞悉客户的偏好,从而指导平台与商家提供更优质的服务。" + ] + }, + { + "cell_type": "markdown", + "id": "2fca0e2b", + "metadata": {}, + "source": [ + "**输入文本**" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "3d3bd87d", + "metadata": {}, + "outputs": [], + "source": [ + "prod_review = \"\"\"\n", + "Got this panda plush toy for my daughter's birthday, \\\n", + "who loves it and takes it everywhere. It's soft and \\ \n", + "super cute, and its face has a friendly look. It's \\ \n", + "a bit small for what I paid though. I think there \\ \n", + "might be other options that are bigger for the \\ \n", + "same price. It arrived a day earlier than expected, \\ \n", + "so I got to play with it myself before I gave it \\ \n", + "to her.\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "1ec202f1", + "metadata": {}, + "source": [ + "**输入文本(中文翻译)**" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "84a653de", + "metadata": {}, + "outputs": [], + "source": [ + "prod_review_zh = \"\"\"\n", + "这个熊猫公仔是我给女儿的生日礼物,她很喜欢,去哪都带着。\n", + "公仔很软,超级可爱,面部表情也很和善。但是相比于价钱来说,\n", + "它有点小,我感觉在别的地方用同样的价钱能买到更大的。\n", + "快递比预期提前了一天到货,所以在送给女儿之前,我自己玩了会。\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "e0f78610", + "metadata": {}, + "source": [ + "### 2.1 限制输出文本长度" + ] + }, + { + "cell_type": "markdown", + "id": "4066804b", + "metadata": {}, + "source": [ + "我们尝试限制文本长度为最多30词。" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "d6e423f2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Soft and cute panda plush toy loved by daughter, but a bit small for the price. Arrived early.\n" + ] + } + ], + "source": [ + "prompt = f\"\"\"\n", + "Your task is to generate a short summary of a product \\\n", + "review from an ecommerce site. \n", + "\n", + "Summarize the review below, delimited by triple \n", + "backticks, in at most 30 words. \n", + "\n", + "Review: ```{prod_review}```\n", + "\"\"\"\n", + "\n", + "response = get_completion(prompt)\n", + "print(response)" + ] + }, + { + "cell_type": "markdown", + "id": "3586d82d", + "metadata": {}, + "source": [ + "中文翻译版本" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "51cbed99", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "可爱软熊猫公仔,女儿喜欢,面部表情和善,但价钱有点小贵,快递提前一天到货。\n" + ] + } + ], + "source": [ + "prompt = f\"\"\"\n", + "你的任务是从电子商务网站上生成一个产品评论的简短摘要。\n", + "\n", + "请对三个反引号之间的评论文本进行概括,最多30个词汇。\n", + "\n", + "评论: ```{prod_review_zh}```\n", + "\"\"\"\n", + "\n", + "response = get_completion(prompt)\n", + "print(response)" + ] + }, + { + "cell_type": "markdown", + "id": "08c1643c", + "metadata": {}, + "source": [ + "### 2.2 关键角度侧重" + ] + }, + { + "cell_type": "markdown", + "id": "f2582b5f", + "metadata": {}, + "source": [ + "有时,针对不同的业务,我们对文本的侧重会有所不同。例如对于商品评论文本,物流会更关心运输时效,商家更加关心价格与商品质量,平台更关心整体服务体验。\n", + "\n", + "我们可以通过增加Prompt提示,来体现对于某个特定角度的侧重。" + ] + }, + { + "cell_type": "markdown", + "id": "9da7a497", + "metadata": {}, + "source": [ + "**侧重于运输**" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "1432a7fe", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The panda plush toy arrived a day earlier than expected, but the customer felt it was a bit small for the price paid.\n" + ] + } + ], + "source": [ + "prompt = f\"\"\"\n", + "Your task is to generate a short summary of a product \\\n", + "review from an ecommerce site to give feedback to the \\\n", + "Shipping deparmtment. \n", + "\n", + "Summarize the review below, delimited by triple \n", + "backticks, in at most 30 words, and focusing on any aspects \\\n", + "that mention shipping and delivery of the product. \n", + "\n", + "Review: ```{prod_review}```\n", + "\"\"\"\n", + "\n", + "response = get_completion(prompt)\n", + "print(response)" + ] + }, + { + "cell_type": "markdown", + "id": "d3b46d0f", + "metadata": {}, + "source": [ + "中文翻译版本" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "8fff34de", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "快递提前到货,熊猫公仔软可爱,但有点小,价钱不太划算。\n" + ] + } + ], + "source": [ + "prompt = f\"\"\"\n", + "你的任务是从电子商务网站上生成一个产品评论的简短摘要。\n", + "\n", + "请对三个反引号之间的评论文本进行概括,最多30个词汇,并且聚焦在产品运输上。\n", + "\n", + "评论: ```{prod_review_zh}```\n", + "\"\"\"\n", + "\n", + "response = get_completion(prompt)\n", + "print(response)" + ] + }, + { + "cell_type": "markdown", + "id": "426fb25f", + "metadata": {}, + "source": [ + "可以看到,输出结果以“快递提前一天到货”开头,体现了对于快递效率的侧重。" + ] + }, + { + "cell_type": "markdown", + "id": "aaa71480", + "metadata": {}, + "source": [ + "**侧重于价格与质量**" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "18546578", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The panda plush toy is soft, cute, and loved by the recipient, but the price may be too high for its size compared to other options.\n" + ] + } + ], + "source": [ + "prompt = f\"\"\"\n", + "Your task is to generate a short summary of a product \\\n", + "review from an ecommerce site to give feedback to the \\\n", + "pricing deparmtment, responsible for determining the \\\n", + "price of the product. \n", + "\n", + "Summarize the review below, delimited by triple \n", + "backticks, in at most 30 words, and focusing on any aspects \\\n", + "that are relevant to the price and perceived value. \n", + "\n", + "Review: ```{prod_review}```\n", + "\"\"\"\n", + "\n", + "response = get_completion(prompt)\n", + "print(response)" + ] + }, + { + "cell_type": "markdown", + "id": "63645a57", + "metadata": {}, + "source": [ + "中文翻译版本" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "826ecc6d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "可爱软熊猫公仔,面部表情友好,但价钱有点高,尺寸较小。快递提前一天到货。\n" + ] + } + ], + "source": [ + "prompt = f\"\"\"\n", + "你的任务是从电子商务网站上生成一个产品评论的简短摘要。\n", + "\n", + "请对三个反引号之间的评论文本进行概括,最多30个词汇,并且聚焦在产品价格和质量上。\n", + "\n", + "评论: ```{prod_review_zh}```\n", + "\"\"\"\n", + "\n", + "response = get_completion(prompt)\n", + "print(response)" + ] + }, + { + "cell_type": "markdown", + "id": "d181a9b4", + "metadata": {}, + "source": [ + "可以看到,输出结果以“质量好、价格小贵、尺寸小”开头,体现了对于产品价格与质量的侧重。" + ] + }, + { + "cell_type": "markdown", + "id": "57595596", + "metadata": {}, + "source": [ + "### 2.3 关键信息提取" + ] + }, + { + "cell_type": "markdown", + "id": "5c1fe1d5", + "metadata": {}, + "source": [ + "在2.2节中,虽然我们通过添加关键角度侧重的Prompt,使得文本摘要更侧重于某一特定方面,但是可以发现,结果中也会保留一些其他信息,如价格与质量角度的概括中仍保留了“快递提前到货”的信息。有时这些信息是有帮助的,但如果我们只想要提取某一角度的信息,并过滤掉其他所有信息,则可以要求LLM进行“文本提取(Extract)”而非“文本概括(Summarize)”。" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "8ae5c70f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\"The product arrived a day earlier than expected.\"\n" + ] + } + ], + "source": [ + "prompt = f\"\"\"\n", + "Your task is to extract relevant information from \\ \n", + "a product review from an ecommerce site to give \\\n", + "feedback to the Shipping department. \n", + "\n", + "From the review below, delimited by triple quotes \\\n", + "extract the information relevant to shipping and \\ \n", + "delivery. Limit to 30 words. \n", + "\n", + "Review: ```{prod_review}```\n", + "\"\"\"\n", + "\n", + "response = get_completion(prompt)\n", + "print(response)" + ] + }, + { + "cell_type": "markdown", + "id": "149b594e", + "metadata": {}, + "source": [ + "中文翻译版本" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "5993dd50", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "快递提前一天到货,但产品有点小,相比价钱来说可以买到更大的。\n" + ] + } + ], + "source": [ + "prompt = f\"\"\"\n", + "你的任务是从电子商务网站上的产品评论中提取相关信息,以向运输部门提供反馈。\n", + "\n", + "请从以下三个反引号之间的评论文本中提取产品运输相关的信息,最多30个词汇。\n", + "\n", + "评论: ```{prod_review_zh}```\n", + "\"\"\"\n", + "\n", + "response = get_completion(prompt)\n", + "print(response)" + ] + }, + { + "cell_type": "markdown", + "id": "9fd0c3b8", + "metadata": {}, + "source": [ + "## 3 多条文本概括Prompt实验" + ] + }, + { + "cell_type": "markdown", + "id": "2b27575c", + "metadata": {}, + "source": [ + "在实际的工作流中,我们往往有许许多多的评论文本,以下展示了一个基于for循环调用“文本概括”工具并依次打印的示例。当然,在实际生产中,对于上百万甚至上千万的评论文本,使用for循环也是不现实的,可能需要考虑整合评论、分布式等方法提升运算效率。" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "74f9930d", + "metadata": {}, + "outputs": [], + "source": [ + "review_1 = prod_review \n", + "\n", + "# review for a standing lamp\n", + "review_2 = \"\"\"\n", + "Needed a nice lamp for my bedroom, and this one \\\n", + "had additional storage and not too high of a price \\\n", + "point. Got it fast - arrived in 2 days. The string \\\n", + "to the lamp broke during the transit and the company \\\n", + "happily sent over a new one. Came within a few days \\\n", + "as well. It was easy to put together. Then I had a \\\n", + "missing part, so I contacted their support and they \\\n", + "very quickly got me the missing piece! Seems to me \\\n", + "to be a great company that cares about their customers \\\n", + "and products. \n", + "\"\"\"\n", + "\n", + "# review for an electric toothbrush\n", + "review_3 = \"\"\"\n", + "My dental hygienist recommended an electric toothbrush, \\\n", + "which is why I got this. The battery life seems to be \\\n", + "pretty impressive so far. After initial charging and \\\n", + "leaving the charger plugged in for the first week to \\\n", + "condition the battery, I've unplugged the charger and \\\n", + "been using it for twice daily brushing for the last \\\n", + "3 weeks all on the same charge. But the toothbrush head \\\n", + "is too small. I’ve seen baby toothbrushes bigger than \\\n", + "this one. I wish the head was bigger with different \\\n", + "length bristles to get between teeth better because \\\n", + "this one doesn’t. Overall if you can get this one \\\n", + "around the $50 mark, it's a good deal. The manufactuer's \\\n", + "replacements heads are pretty expensive, but you can \\\n", + "get generic ones that're more reasonably priced. This \\\n", + "toothbrush makes me feel like I've been to the dentist \\\n", + "every day. My teeth feel sparkly clean! \n", + "\"\"\"\n", + "\n", + "# review for a blender\n", + "review_4 = \"\"\"\n", + "So, they still had the 17 piece system on seasonal \\\n", + "sale for around $49 in the month of November, about \\\n", + "half off, but for some reason (call it price gouging) \\\n", + "around the second week of December the prices all went \\\n", + "up to about anywhere from between $70-$89 for the same \\\n", + "system. And the 11 piece system went up around $10 or \\\n", + "so in price also from the earlier sale price of $29. \\\n", + "So it looks okay, but if you look at the base, the part \\\n", + "where the blade locks into place doesn’t look as good \\\n", + "as in previous editions from a few years ago, but I \\\n", + "plan to be very gentle with it (example, I crush \\\n", + "very hard items like beans, ice, rice, etc. in the \\ \n", + "blender first then pulverize them in the serving size \\\n", + "I want in the blender then switch to the whipping \\\n", + "blade for a finer flour, and use the cross cutting blade \\\n", + "first when making smoothies, then use the flat blade \\\n", + "if I need them finer/less pulpy). Special tip when making \\\n", + "smoothies, finely cut and freeze the fruits and \\\n", + "vegetables (if using spinach-lightly stew soften the \\ \n", + "spinach then freeze until ready for use-and if making \\\n", + "sorbet, use a small to medium sized food processor) \\ \n", + "that you plan to use that way you can avoid adding so \\\n", + "much ice if at all-when making your smoothie. \\\n", + "After about a year, the motor was making a funny noise. \\\n", + "I called customer service but the warranty expired \\\n", + "already, so I had to buy another one. FYI: The overall \\\n", + "quality has gone done in these types of products, so \\\n", + "they are kind of counting on brand recognition and \\\n", + "consumer loyalty to maintain sales. Got it in about \\\n", + "two days.\n", + "\"\"\"\n", + "\n", + "reviews = [review_1, review_2, review_3, review_4]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5fb23db5", + "metadata": {}, + "outputs": [], + "source": [ + "for i in range(len(reviews)):\n", + " prompt = f\"\"\"\n", + " Your task is to generate a short summary of a product \\ \n", + " review from an ecommerce site. \n", + "\n", + " Summarize the review below, delimited by triple \\\n", + " backticks in at most 20 words. \n", + "\n", + " Review: ```{reviews[i]}```\n", + " \"\"\"\n", + "\n", + " response = get_completion(prompt)\n", + " print(i, response, \"\\n\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c0be2824", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:chatgpt]", + "language": "python", + "name": "conda-env-chatgpt-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}