finish 6 问答
This commit is contained in:
@ -0,0 +1,849 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 第六章 问答"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 一、引言\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"在上一章,我们已经讨论了如何检索与给定问题相关的文档。下一步是获取这些文档,拿到原始问题,将它们一起传递给语言模型,并要求它回答这个问题。在本课程中,我们将详细介绍这一过程,以及完成这项任务的几种不同方法。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们已经完成了整个存储和获取,获取了相关的切分文档之后,现在我们需要将它们传递给语言模型,以获得答案。这个过程的一般流程如下:首先问题被提出,然后我们查找相关的文档,接着将这些切分文档和系统提示一起传递给语言模型,并获得答案。\n",
|
||||
"\n",
|
||||
"默认情况下,我们将所有的文档切片都传递到同一个上下文窗口中,即同一次语言模型调用中。然而,有一些不同的方法可以解决这个问题,它们都有优缺点。大部分优点来自于有时可能会有很多文档,但你简单地无法将它们全部传递到同一个上下文窗口中。MapReduce、Refine 和 MapRerank 是三种方法,用于解决这个短上下文窗口的问题。我们将在该课程中进行简要介绍。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 二、环境配置"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"配置环境方法同前,此处不再赘述"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import openai\n",
|
||||
"import sys\n",
|
||||
"sys.path.append('../..')\n",
|
||||
"\n",
|
||||
"from dotenv import load_dotenv, find_dotenv\n",
|
||||
"\n",
|
||||
"_ = load_dotenv(find_dotenv()) # read local .env file\n",
|
||||
"\n",
|
||||
"openai.api_key = os.environ['OPENAI_API_KEY']\n",
|
||||
"\n",
|
||||
"os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'\n",
|
||||
"os.environ[\"HTTP_PROXY\"] = 'http://127.0.0.1:7890'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在2023年9月2日之后,GPT-3.5 API 会进行更新,因此此处需要进行一个时间判断"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"gpt-3.5-turbo-0301\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import datetime\n",
|
||||
"current_date = datetime.datetime.now().date()\n",
|
||||
"if current_date < datetime.date(2023, 9, 2):\n",
|
||||
" llm_name = \"gpt-3.5-turbo-0301\"\n",
|
||||
"else:\n",
|
||||
" llm_name = \"gpt-3.5-turbo\"\n",
|
||||
"print(llm_name)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 三、加载向量数据库"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 加载在之前已经进行持久化的向量数据库\n",
|
||||
"from langchain.vectorstores import Chroma\n",
|
||||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||||
"persist_directory = 'docs/chroma'\n",
|
||||
"embedding = OpenAIEmbeddings()\n",
|
||||
"vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"209\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 可以看见包含了我们之前进行分割的209个文档\n",
|
||||
"print(vectordb._collection.count())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 1/1 [00:07<00:00, 7.17s/it]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"'''由于目前缺乏第五章检索,暂时初始化一个向量数据库代替'''\n",
|
||||
"from langchain.document_loaders import PyPDFLoader\n",
|
||||
"\n",
|
||||
"# 加载 PDF\n",
|
||||
"loaders = [\n",
|
||||
" # 故意添加重复文档,使数据混乱\n",
|
||||
" PyPDFLoader(\"docs/cs229_lectures/MachineLearning-Lecture01.pdf\"),\n",
|
||||
" PyPDFLoader(\"docs/cs229_lectures/MachineLearning-Lecture01.pdf\"),\n",
|
||||
" PyPDFLoader(\"docs/cs229_lectures/MachineLearning-Lecture02.pdf\"),\n",
|
||||
" PyPDFLoader(\"docs/cs229_lectures/MachineLearning-Lecture03.pdf\")\n",
|
||||
"]\n",
|
||||
"docs = []\n",
|
||||
"for loader in loaders:\n",
|
||||
" docs.extend(loader.load())\n",
|
||||
"\n",
|
||||
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
|
||||
"text_splitter = RecursiveCharacterTextSplitter(\n",
|
||||
" chunk_size = 1500, # 每个文本块的大小。这意味着每次切分文本时,会尽量使每个块包含 1500 个字符。\n",
|
||||
" chunk_overlap = 150 # 每个文本块之间的重叠部分。\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"splits = text_splitter.split_documents(docs)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"vectordb = Chroma.from_documents(\n",
|
||||
" documents=splits,\n",
|
||||
" embedding=embedding,\n",
|
||||
" persist_directory=persist_directory # 允许我们将persist_directory目录保存到磁盘上\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们可以测试一下对于一个提问进行向量检索。如下代码会在向量数据库中根据相似性进行检索,返回给你 k 个文档。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 34,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"3"
|
||||
]
|
||||
},
|
||||
"execution_count": 34,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"question = \"What are major topics for this class?\"\n",
|
||||
"docs = vectordb.similarity_search(question,k=3)\n",
|
||||
"len(docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 35,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"3"
|
||||
]
|
||||
},
|
||||
"execution_count": 35,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"question = \"这节课的主要话题是什么\"\n",
|
||||
"docs = vectordb.similarity_search(question,k=3)\n",
|
||||
"len(docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 四、构造检索式问答连"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"基于 LangChain,我们可以构造一个使用 GPT3.5 进行问答的检索式问答链,这是一种通过检索步骤进行问答的方法。我们可以通过传入一个语言模型和一个向量数据库来创建它作为检索器。然后,我们可以用问题作为查询调用它,得到一个答案。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 使用 ChatGPT3.5,温度设置为0\n",
|
||||
"from langchain.chat_models import ChatOpenAI\n",
|
||||
"llm = ChatOpenAI(model_name=llm_name, temperature=0)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 导入检索式问答链\n",
|
||||
"from langchain.chains import RetrievalQA"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 声明一个检索式问答链\n",
|
||||
"qa_chain = RetrievalQA.from_chain_type(\n",
|
||||
" llm,\n",
|
||||
" retriever=vectordb.as_retriever()\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 可以以该方式进行检索问答\n",
|
||||
"question = \"What are major topics for this class?\"\n",
|
||||
"result = qa_chain({\"query\": question})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'The major topic for this class is machine learning. Additionally, there may be some discussion on statistics and algebra as a refresher, and later in the quarter, there may be some discussion on extensions for the material covered in the main lectures.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result[\"result\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 可以以该方式进行检索问答\n",
|
||||
"question = \"这节课的主要话题是什么\"\n",
|
||||
"result = qa_chain({\"query\": question})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 33,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"从这些上下文来看,这节课的主要话题包括课程信息、在线资源和线性代数。\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(result[\"result\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 五、深入探究检索式问答链"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"通过上述代码,我们可以实现一个简单的检索式问答链。接下来,让我们深入其中的细节,看看在这个检索式问答链中,LangChain 都做了些什么。\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 5.1 基于模板的检索式问答链"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"我们首先定义了一个提示模板。它包含一些关于如何使用下面的上下文片段的说明,然后有一个上下文变量的占位符。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.prompts import PromptTemplate\n",
|
||||
"\n",
|
||||
"# Build prompt\n",
|
||||
"template = \"\"\"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say \"thanks for asking!\" at the end of the answer. \n",
|
||||
"{context}\n",
|
||||
"Question: {question}\n",
|
||||
"Helpful Answer:\"\"\"\n",
|
||||
"QA_CHAIN_PROMPT = PromptTemplate.from_template(template)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 36,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 中文版\n",
|
||||
"from langchain.prompts import PromptTemplate\n",
|
||||
"\n",
|
||||
"# Build prompt\n",
|
||||
"template = \"\"\"使用以下上下文片段来回答最后的问题。如果你不知道答案,只需说不知道,不要试图编造答案。答案最多使用三个句子。尽量简明扼要地回答。在回答的最后一定要说\"感谢您的提问!\"\n",
|
||||
"{context}\n",
|
||||
"问题:{question}\n",
|
||||
"有用的回答:\"\"\"\n",
|
||||
"QA_CHAIN_PROMPT = PromptTemplate.from_template(template)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 37,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Run chain\n",
|
||||
"qa_chain = RetrievalQA.from_chain_type(\n",
|
||||
" llm,\n",
|
||||
" retriever=vectordb.as_retriever(),\n",
|
||||
" return_source_documents=True,\n",
|
||||
" chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT}\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 38,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"question = \"Is probability a class topic?\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 39,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"result = qa_chain({\"query\": question})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 40,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Yes, probability is a class topic and the instructor assumes familiarity with basic probability and statistics.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 40,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result[\"result\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 47,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 中文版\n",
|
||||
"question = \"机器学习是其中一节的话题吗\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 48,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"result = qa_chain({\"query\": question})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 49,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'是的,机器学习是其中一节的话题。感谢您的提问!'"
|
||||
]
|
||||
},
|
||||
"execution_count": 49,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result[\"result\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 50,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Document(page_content=\"So in this class, we've tried to convey to you a broad set of principl es and tools that will \\nbe useful for doing many, many things. And ev ery time I teach this class, I can actually \\nvery confidently say that af ter December, no matter what yo u're going to do after this \\nDecember when you've sort of completed this class, you'll find the things you learn in \\nthis class very useful, and these things will be useful pretty much no matter what you end \\nup doing later in your life. \\nSo I have more logistics to go over later, but let's say a few more words about machine \\nlearning. I feel that machine learning grew out of early work in AI, early work in artificial \\nintelligence. And over the last — I wanna say last 15 or last 20 years or so, it's been viewed as a sort of growing new capability for computers. And in particular, it turns out \\nthat there are many programs or there are many applications that you can't program by \\nhand. \\nFor example, if you want to get a computer to read handwritten characters, to read sort of \\nhandwritten digits, that actual ly turns out to be amazingly difficult to write a piece of \\nsoftware to take this input, an image of some thing that I wrote and to figure out just what \\nit is, to translate my cursive handwriting into — to extract the characters I wrote out in \\nlonghand. And other things: One thing that my students and I do is autonomous flight. It \\nturns out to be extremely difficult to sit dow n and write a program to fly a helicopter.\", metadata={'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 2})"
|
||||
]
|
||||
},
|
||||
"execution_count": 50,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result[\"source_documents\"][0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"这种方法非常好,因为它只涉及对语言模型的一次调用。然而,它也有局限性,即如果文档太多,可能无法将它们全部适配到上下文窗口中。我们可以使用另一种技术来对文档进行问答,即MapReduce技术。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 5.2 基于 MapReduce 的检索式问答链"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在 MapReduce 技术中,首先将每个独立的文档单独发送到语言模型以获取原始答案。然后,这些答案通过最终对语言模型的一次调用组合成最终的答案。虽然这样涉及了更多对语言模型的调用,但它的优势在于可以处理任意数量的文档。\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 51,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"qa_chain_mr = RetrievalQA.from_chain_type(\n",
|
||||
" llm,\n",
|
||||
" retriever=vectordb.as_retriever(),\n",
|
||||
" chain_type=\"map_reduce\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 52,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"question = \"Is probability a class topic?\"\n",
|
||||
"result = qa_chain_mr({\"query\": question})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 53,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'It is not clear from the given portion of the document whether probability is a class topic or not. The text only mentions that familiarity with basic probability and statistics is assumed as a prerequisite for the class.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 53,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result[\"result\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 55,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'根据给出的文件部分,没有提到概率论。'"
|
||||
]
|
||||
},
|
||||
"execution_count": 55,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"qa_chain_mr = RetrievalQA.from_chain_type(\n",
|
||||
" llm,\n",
|
||||
" retriever=vectordb.as_retriever(),\n",
|
||||
" chain_type=\"map_reduce\"\n",
|
||||
")\n",
|
||||
"# 中文版\n",
|
||||
"question = \"概率论是其中一节的话题吗\"\n",
|
||||
"result = qa_chain_mr({\"query\": question})\n",
|
||||
"result[\"result\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"当我们将之前的问题通过这个链进行运行时,我们可以看到这种方法的两个问题。第一,速度要慢得多。第二,结果实际上更差。根据给定文档的这一部分,对这个问题并没有明确的答案。这可能是因为它是基于每个文档单独回答的。因此,如果信息分布在两个文档之间,它并没有在同一上下文中获取到所有的信息。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#import os\n",
|
||||
"#os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
|
||||
"#os.environ[\"LANGCHAIN_ENDPOINT\"] = \"https://api.langchain.plus\"\n",
|
||||
"#os.environ[\"LANGCHAIN_API_KEY\"] = \"...\" # replace dots with your api key"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们可导入上述环境变量,然后探寻 MapReduce 文档链的细节。例如,上述演示中,我们实际上涉及了四个单独的对语言模型的调用。在运行完每个文档后,它们会在最终链式中组合在一起,即Stuffed Documents链,将所有这些回答合并到最终的调用中。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 5.3 基于 Refine 的检索式问答链"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们可以类似地设置链式类型为Refine。这是一种新的链式类型。Refine 文档链类似于 MapReduce 链,对于每一个文档,会调用一次 LLM,但有所改进的是,我们每次发送给 LLM 的最终提示是一个序列,这个序列会将先前的响应与新数据结合在一起,并请求得到改进后的响应。因此,这是一种类似于 RNN 的概念,我们增强了上下文,从而解决信息分布在不同文档的问题。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 56,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Based on the additional context provided, probability is assumed to be a prerequisite and not a main topic of the class. The instructor assumes that students are familiar with basic probability and statistics, including random variables, expectation, variance, and basic linear algebra. The class will not be very programming-intensive, but some programming will be done in MATLAB or Octave. The instructor will provide a refresher course on the prerequisites in some of the discussion sections. The class also assumes familiarity with basic linear algebra, including matrices, vectors, matrix multiplication, and matrix inverse. Most undergraduate linear algebra courses, such as Math 51, 103, Math 113, or CS205 at Stanford, are more than enough. The instructor will also review the prerequisites in some of the discussion sections.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 56,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"qa_chain_mr = RetrievalQA.from_chain_type(\n",
|
||||
" llm,\n",
|
||||
" retriever=vectordb.as_retriever(),\n",
|
||||
" chain_type=\"refine\"\n",
|
||||
")\n",
|
||||
"question = \"Is probability a class topic?\"\n",
|
||||
"result = qa_chain_mr({\"query\": question})\n",
|
||||
"result[\"result\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 57,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Based on the additional context provided, the instructor mentions that they will cover statistics and algebra in the discussion sections as a refresher, and will also use the discussion sections to go over extensions of the material taught in the main lectures. However, there is no explicit mention of probability theory being covered in the course. Therefore, the original answer still stands.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 57,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"qa_chain_mr = RetrievalQA.from_chain_type(\n",
|
||||
" llm,\n",
|
||||
" retriever=vectordb.as_retriever(),\n",
|
||||
" chain_type=\"refine\"\n",
|
||||
")\n",
|
||||
"question = \"概率论是其中一节的话题吗\"\n",
|
||||
"result = qa_chain_mr({\"query\": question})\n",
|
||||
"result[\"result\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"你会注意到,这个结果比MapReduce链的结果要好。这是因为使用Refined Chain允许你逐个地组合信息,实际上比MapReduce链鼓励更多的信息传递。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 六、实验:状态记录"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"让我们在这里做一个实验。\n",
|
||||
"\n",
|
||||
"我们将创建一个QA链,使用默认的stuff。让我们问一个问题,概率论是课程的主题吗?它会回答,概率论应该是先决条件。我们将追问,为什么需要这些先决条件?然后我们得到了一个答案。这门课的先决条件是假定具有计算机科学和基本计算机技能和原理的基本知识。这与之前问有关概率的问题毫不相关。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 58,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"qa_chain = RetrievalQA.from_chain_type(\n",
|
||||
" llm,\n",
|
||||
" retriever=vectordb.as_retriever()\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 59,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Yes, probability is a topic in this class. The speaker assumes that students have familiarity with basic probability and statistics, and mentions that most undergraduate statistics classes will be more than enough preparation for this class.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 59,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"question = \"Is probability a class topic?\"\n",
|
||||
"result = qa_chain({\"query\": question})\n",
|
||||
"result[\"result\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 60,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'The prerequisites are needed because in this class, the instructor assumes that all students have a basic knowledge of computer science and knowledge of basic computer skills and principles. This includes understanding of big-O notation and other fundamental concepts. Without this basic knowledge, it may be difficult to understand the material covered in the class.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 60,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"question = \"why are those prerequesites needed?\"\n",
|
||||
"result = qa_chain({\"query\": question})\n",
|
||||
"result[\"result\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 62,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'是的,作者在文中提到了这门课程需要学生具备基本的概率论和统计学知识。'"
|
||||
]
|
||||
},
|
||||
"execution_count": 62,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"question = \"概率论是这节课的一个内容吗\"\n",
|
||||
"result = qa_chain({\"query\": question})\n",
|
||||
"result[\"result\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 63,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'在这段上下文中,作者提到这些知识是这门课程的先决条件,因为这门课程涉及到机器学习的基本概念和算法,需要学生具备计算机科学和基本计算机技能和原理的基本知识。如果学生没有这些基础知识,可能会很难理解和应用机器学习算法。因此,学生需要具备这些知识才能更好地学习和应用机器学习。'"
|
||||
]
|
||||
},
|
||||
"execution_count": 63,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"question = \"为什么需要具备这些知识\"\n",
|
||||
"result = qa_chain({\"query\": question})\n",
|
||||
"result[\"result\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"基本上,我们使用的链式(chain)没有任何状态的概念。它不记得之前的问题或之前的答案。为了实现这一点,我们需要引入内存,这是我们将在下一节中讨论的内容。"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "gpt",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.11"
|
||||
},
|
||||
"orig_nbformat": 4
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
Reference in New Issue
Block a user