fix: 根据修改建议重新修改
1、添加中文文档 2、添加中文prompt
This commit is contained in:
@ -81,7 +81,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 44,
|
||||
"execution_count": 5,
|
||||
"id": "7f8d2266-4a35-4904-ae9d-c89790c5ae61",
|
||||
"metadata": {
|
||||
"height": 166,
|
||||
@ -105,7 +105,9 @@
|
||||
"id": "460a54b0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们刚刚讨论了`Document Loading`(文档加载)和`Splitting`(分割)."
|
||||
"前两节课我们讨论了`Document Loading`(文档加载)和`Splitting`(分割)。\n",
|
||||
"\n",
|
||||
"下面我们将使用前两节课的知识对文档进行加载分割。"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -126,7 +128,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 45,
|
||||
"execution_count": 6,
|
||||
"id": "2437469e",
|
||||
"metadata": {
|
||||
"height": 249,
|
||||
@ -149,6 +151,36 @@
|
||||
" docs.extend(loader.load())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d1ed78d9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"下面文档是datawhale官方开源的matplotlib教程链接 https://datawhalechina.github.io/fantastic-matplotlib/index.html ,可在该网站上下载对应的教程"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 60,
|
||||
"id": "11de2150",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import PyPDFLoader\n",
|
||||
"\n",
|
||||
"# 加载 PDF\n",
|
||||
"loaders_chinese = [\n",
|
||||
" # 故意添加重复文档,使数据混乱\n",
|
||||
" PyPDFLoader(\"docs/matplotlib/第一回:Matplotlib初相识.pdf\"),\n",
|
||||
" PyPDFLoader(\"docs/matplotlib/第一回:Matplotlib初相识.pdf\"),\n",
|
||||
" PyPDFLoader(\"docs/matplotlib/第二回:艺术画笔见乾坤.pdf\"),\n",
|
||||
" PyPDFLoader(\"docs/matplotlib/第三回:布局格式定方圆.pdf\")\n",
|
||||
"]\n",
|
||||
"docs_chinese = []\n",
|
||||
"for loader in loaders_chinese:\n",
|
||||
" docs_chinese.extend(loader.load())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4b497f5c",
|
||||
@ -159,7 +191,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"execution_count": 61,
|
||||
"id": "eb44bf0d",
|
||||
"metadata": {
|
||||
"height": 115,
|
||||
@ -177,7 +209,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"execution_count": 62,
|
||||
"id": "b71e46cc",
|
||||
"metadata": {
|
||||
"height": 30,
|
||||
@ -190,7 +222,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"execution_count": 63,
|
||||
"id": "e061f22d",
|
||||
"metadata": {
|
||||
"height": 30,
|
||||
@ -203,7 +235,7 @@
|
||||
"209"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"execution_count": 63,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@ -212,6 +244,37 @@
|
||||
"len(splits)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 64,
|
||||
"id": "99840d4c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"splits_chinese = text_splitter.split_documents(docs_chinese)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 65,
|
||||
"id": "d6b60ceb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"27"
|
||||
]
|
||||
},
|
||||
"execution_count": 65,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"len(splits_chinese)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "848e26fd",
|
||||
@ -230,7 +293,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"execution_count": 13,
|
||||
"id": "d9dca7a8",
|
||||
"metadata": {
|
||||
"height": 47,
|
||||
@ -254,7 +317,31 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"execution_count": 14,
|
||||
"id": "abae34b5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sentence1 = \"i like dogs\"\n",
|
||||
"sentence2 = \"i like canines\"\n",
|
||||
"sentence3 = \"the weather is ugly outside\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"id": "d1745b19",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"embedding1 = embedding.embed_query(sentence1)\n",
|
||||
"embedding2 = embedding.embed_query(sentence2)\n",
|
||||
"embedding3 = embedding.embed_query(sentence3)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"id": "c4099521",
|
||||
"metadata": {
|
||||
"height": 64,
|
||||
@ -262,14 +349,14 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sentence1 = \"我喜欢狗\"\n",
|
||||
"sentence2 = \"我喜欢犬科动物\"\n",
|
||||
"sentence3 = \"外面的天气很糟糕\""
|
||||
"sentence1_chinese = \"我喜欢狗\"\n",
|
||||
"sentence2_chinese = \"我喜欢犬科动物\"\n",
|
||||
"sentence3_chinese = \"外面的天气很糟糕\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"execution_count": 17,
|
||||
"id": "d553549a",
|
||||
"metadata": {
|
||||
"height": 64,
|
||||
@ -277,9 +364,9 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"embedding1 = embedding.embed_query(sentence1)\n",
|
||||
"embedding2 = embedding.embed_query(sentence2)\n",
|
||||
"embedding3 = embedding.embed_query(sentence3)"
|
||||
"embedding1_chinese = embedding.embed_query(sentence1_chinese)\n",
|
||||
"embedding2_chinese = embedding.embed_query(sentence2_chinese)\n",
|
||||
"embedding3_chinese = embedding.embed_query(sentence3_chinese)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -300,7 +387,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"execution_count": 18,
|
||||
"id": "0cbe9a9e",
|
||||
"metadata": {
|
||||
"height": 30,
|
||||
@ -311,6 +398,81 @@
|
||||
"import numpy as np"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"id": "536db32e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"0.9631853877103518"
|
||||
]
|
||||
},
|
||||
"execution_count": 19,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"np.dot(embedding1, embedding2)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"id": "29844c64",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"0.7709997651294672"
|
||||
]
|
||||
},
|
||||
"execution_count": 20,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"np.dot(embedding1, embedding3)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"id": "c1e17a9e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"0.7596334120325523"
|
||||
]
|
||||
},
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"np.dot(embedding2, embedding3)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bdf08401",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们可以看到前两个`embedding`的分数相当高,为0.96。\n",
|
||||
"\n",
|
||||
"如果我们将第一个`embedding`与第三个`embedding`进行比较,我们可以看到它明显较低,约为0.77。\n",
|
||||
"\n",
|
||||
"如果我们将第二个`embedding`和第三个`embedding`进行比较,我们可以看到它的分数大约为0.75。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
@ -332,7 +494,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"np.dot(embedding1, embedding2)"
|
||||
"np.dot(embedding1_chinese, embedding2_chinese)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -356,7 +518,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"np.dot(embedding1, embedding3)"
|
||||
"np.dot(embedding1_chinese, embedding3_chinese)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -380,7 +542,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"np.dot(embedding2, embedding3)"
|
||||
"np.dot(embedding2_chinese, embedding3_chinese)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -403,17 +565,25 @@
|
||||
"## 四、Vectorstores"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "54776973",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 4.1 初始化Chroma"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "20916c7c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Langchain集成有超过30个不同的向量存储。我们选择Chroma是因为它轻量级且内存中,这使得它非常容易启动和开始使用。"
|
||||
"Langchain集成了超过30个不同的向量存储库。我们选择Chroma是因为它轻量级且数据存储在内存中,这使得它非常容易启动和开始使用。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"execution_count": 24,
|
||||
"id": "201e6afa",
|
||||
"metadata": {
|
||||
"height": 30,
|
||||
@ -426,7 +596,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"execution_count": 28,
|
||||
"id": "93960ac5",
|
||||
"metadata": {
|
||||
"height": 30,
|
||||
@ -434,7 +604,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"persist_directory = 'docs/chroma/'"
|
||||
"persist_directory = 'docs/chroma/cs229_lectures/'"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -447,12 +617,12 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!rm -rf './docs/chroma' # 删除旧的数据库文件(如果文件夹中有文件的话)"
|
||||
"!rm -rf './docs/chroma/cs229_lectures' # 删除旧的数据库文件(如果文件夹中有文件的话),window电脑请手动删除"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"execution_count": 29,
|
||||
"id": "690efd0a",
|
||||
"metadata": {
|
||||
"height": 98,
|
||||
@ -469,7 +639,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"execution_count": 30,
|
||||
"id": "f777480c",
|
||||
"metadata": {
|
||||
"height": 30,
|
||||
@ -488,12 +658,64 @@
|
||||
"print(vectordb._collection.count())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 66,
|
||||
"id": "d720e2b7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"persist_directory_chinese = 'docs/chroma/matplotlib/'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "570a2768",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!rm -rf './docs/chroma/matplotlib' # 删除旧的数据库文件(如果文件夹中有文件的话)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 67,
|
||||
"id": "1e03438d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"vectordb_chinese = Chroma.from_documents(\n",
|
||||
" documents=splits_chinese,\n",
|
||||
" embedding=embedding,\n",
|
||||
" persist_directory=persist_directory_chinese # 允许我们将persist_directory目录保存到磁盘上\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 68,
|
||||
"id": "cf87aba0",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"27\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(vectordb_chinese._collection.count())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9f78f412",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们可以看到它的长度也是209,这与我们之前的切分数量是一样的。现在让我们开始使用它。"
|
||||
"我们可以看到英文版的长度也是209、中文版的长度也是30,这与我们之前的切分数量是一样的。现在让我们开始使用它。"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -501,7 +723,7 @@
|
||||
"id": "efca7589",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 相似性搜索(Similarity Search)"
|
||||
"### 4.2 相似性搜索(Similarity Search)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -585,8 +807,84 @@
|
||||
"source": [
|
||||
"如果我们查看第一个文档的内容,我们可以看到它实际上是关于一个电子邮件地址,cs229-qa@cs.stanford.edu。\n",
|
||||
"\n",
|
||||
"这是我们可以向其发送问题的电子邮件,所有的助教都会阅读这些邮件。\n",
|
||||
"\n",
|
||||
"这是我们可以向其发送问题的电子邮件,所有的助教都会阅读这些邮件。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 69,
|
||||
"id": "53bcc061",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"question_chinese = \"Matplotlib是什么?\" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 70,
|
||||
"id": "2d2dc834",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docs_chinese = vectordb_chinese.similarity_search(question_chinese,k=3)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 71,
|
||||
"id": "eebe77e7",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"3"
|
||||
]
|
||||
},
|
||||
"execution_count": 71,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"len(docs_chinese)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 72,
|
||||
"id": "860af154",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'第⼀回:Matplotlib 初相识\\n⼀、认识matplotlib\\nMatplotlib 是⼀个 Python 2D 绘图库,能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形,⽤来绘制各种静态,动态,\\n交互式的图表。\\nMatplotlib 可⽤于 Python 脚本, Python 和 IPython Shell 、 Jupyter notebook , Web 应⽤程序服务器和各种图形⽤户界⾯⼯具包等。\\nMatplotlib 是 Python 数据可视化库中的泰⽃,它已经成为 python 中公认的数据可视化⼯具,我们所熟知的 pandas 和 seaborn 的绘图接⼝\\n其实也是基于 matplotlib 所作的⾼级封装。\\n为了对matplotlib 有更好的理解,让我们从⼀些最基本的概念开始认识它,再逐渐过渡到⼀些⾼级技巧中。\\n⼆、⼀个最简单的绘图例⼦\\nMatplotlib 的图像是画在 figure (如 windows , jupyter 窗体)上的,每⼀个 figure ⼜包含了⼀个或多个 axes (⼀个可以指定坐标系的⼦区\\n域)。最简单的创建 figure 以及 axes 的⽅式是通过 pyplot.subplots命令,创建 axes 以后,可以使⽤ Axes.plot绘制最简易的折线图。\\nimport matplotlib.pyplot as plt\\nimport matplotlib as mpl\\nimport numpy as np\\nfig, ax = plt.subplots() # 创建⼀个包含⼀个 axes 的 figure\\nax.plot([1, 2, 3, 4], [1, 4, 2, 3]); # 绘制图像\\nTrick: 在jupyter notebook 中使⽤ matplotlib 时会发现,代码运⾏后⾃动打印出类似 <matplotlib.lines.Line2D at 0x23155916dc0>\\n这样⼀段话,这是因为 matplotlib 的绘图代码默认打印出最后⼀个对象。如果不想显示这句话,有以下三种⽅法,在本章节的代码示例\\n中你能找到这三种⽅法的使⽤。\\n\\x00. 在代码块最后加⼀个分号 ;\\n\\x00. 在代码块最后加⼀句 plt.show()\\n\\x00. 在绘图时将绘图对象显式赋值给⼀个变量,如将 plt.plot([1, 2, 3, 4]) 改成 line =plt.plot([1, 2, 3, 4])\\n和MATLAB 命令类似,你还可以通过⼀种更简单的⽅式绘制图像, matplotlib.pyplot⽅法能够直接在当前 axes 上绘制图像,如果⽤户\\n未指定axes , matplotlib 会帮你⾃动创建⼀个。所以上⾯的例⼦也可以简化为以下这⼀⾏代码。\\nline =plt.plot([1, 2, 3, 4], [1, 4, 2, 3]) \\n三、Figure 的组成\\n现在我们来深⼊看⼀下 figure 的组成。通过⼀张 figure 解剖图,我们可以看到⼀个完整的 matplotlib 图像通常会包括以下四个层级,这些\\n层级也被称为容器( container ),下⼀节会详细介绍。在 matplotlib 的世界中,我们将通过各种命令⽅法来操纵图像中的每⼀个部分,\\n从⽽达到数据可视化的最终效果,⼀副完整的图像实际上是各类⼦元素的集合。\\nFigure:顶层级,⽤来容纳所有绘图元素'"
|
||||
]
|
||||
},
|
||||
"execution_count": 72,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"docs_chinese[0].page_content"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "60a720a6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"如果我们查看第一个文档的内容,我们可以看到它实际上是关于Matplotlib的介绍"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fba57a1d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在此之后,我们要确保通过运行vectordb.persist来持久化向量数据库,以便我们在未来的课程中使用。\n",
|
||||
"\n",
|
||||
"让我们保存它,以便以后使用!"
|
||||
@ -594,7 +892,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 35,
|
||||
"execution_count": 44,
|
||||
"id": "ea657123",
|
||||
"metadata": {
|
||||
"height": 30,
|
||||
@ -605,6 +903,16 @@
|
||||
"vectordb.persist()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 73,
|
||||
"id": "9065f5b1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"vectordb_chinese.persist()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cefe9f6a",
|
||||
@ -651,16 +959,36 @@
|
||||
"docs = vectordb.similarity_search(question,k=5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 51,
|
||||
"id": "34359d7a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"question_chinese = \"Matplotlib是什么?\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 52,
|
||||
"id": "c0546006",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docs_chinese = vectordb_chinese.similarity_search(question_chinese,k=5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2a9f579e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"请注意,我们得到了重复的块(因为索引中有重复的 `MachineLearning-Lecture01.pdf`)。\n",
|
||||
"请注意,我们得到了重复的块(因为索引中有重复的 `MachineLearning-Lecture01.pdf`、`第一回:Matplotlib初相识.pdf`)。\n",
|
||||
"\n",
|
||||
"语义搜索获取所有相似的文档,但不强制多样性。\n",
|
||||
"\n",
|
||||
"`docs[0]` 和 `docs[1]` 是完全相同的。"
|
||||
"`docs[0]` 和 `docs[1]` 是完全相同的,以及`docs_chinese[0]` 和 `docs_chinese[1]` 是完全相同的。"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -711,6 +1039,48 @@
|
||||
"docs[1]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 53,
|
||||
"id": "092fd2f5",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Document(page_content='第⼀回:Matplotlib 初相识\\n⼀、认识matplotlib\\nMatplotlib 是⼀个 Python 2D 绘图库,能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形,⽤来绘制各种静态,动态,\\n交互式的图表。\\nMatplotlib 可⽤于 Python 脚本, Python 和 IPython Shell 、 Jupyter notebook , Web 应⽤程序服务器和各种图形⽤户界⾯⼯具包等。\\nMatplotlib 是 Python 数据可视化库中的泰⽃,它已经成为 python 中公认的数据可视化⼯具,我们所熟知的 pandas 和 seaborn 的绘图接⼝\\n其实也是基于 matplotlib 所作的⾼级封装。\\n为了对matplotlib 有更好的理解,让我们从⼀些最基本的概念开始认识它,再逐渐过渡到⼀些⾼级技巧中。\\n⼆、⼀个最简单的绘图例⼦\\nMatplotlib 的图像是画在 figure (如 windows , jupyter 窗体)上的,每⼀个 figure ⼜包含了⼀个或多个 axes (⼀个可以指定坐标系的⼦区\\n域)。最简单的创建 figure 以及 axes 的⽅式是通过 pyplot.subplots命令,创建 axes 以后,可以使⽤ Axes.plot绘制最简易的折线图。\\nimport matplotlib.pyplot as plt\\nimport matplotlib as mpl\\nimport numpy as np\\nfig, ax = plt.subplots() # 创建⼀个包含⼀个 axes 的 figure\\nax.plot([1, 2, 3, 4], [1, 4, 2, 3]); # 绘制图像\\nTrick: 在jupyter notebook 中使⽤ matplotlib 时会发现,代码运⾏后⾃动打印出类似 <matplotlib.lines.Line2D at 0x23155916dc0>\\n这样⼀段话,这是因为 matplotlib 的绘图代码默认打印出最后⼀个对象。如果不想显示这句话,有以下三种⽅法,在本章节的代码示例\\n中你能找到这三种⽅法的使⽤。\\n\\x00. 在代码块最后加⼀个分号 ;\\n\\x00. 在代码块最后加⼀句 plt.show()\\n\\x00. 在绘图时将绘图对象显式赋值给⼀个变量,如将 plt.plot([1, 2, 3, 4]) 改成 line =plt.plot([1, 2, 3, 4])\\n和MATLAB 命令类似,你还可以通过⼀种更简单的⽅式绘制图像, matplotlib.pyplot⽅法能够直接在当前 axes 上绘制图像,如果⽤户\\n未指定axes , matplotlib 会帮你⾃动创建⼀个。所以上⾯的例⼦也可以简化为以下这⼀⾏代码。\\nline =plt.plot([1, 2, 3, 4], [1, 4, 2, 3]) \\n三、Figure 的组成\\n现在我们来深⼊看⼀下 figure 的组成。通过⼀张 figure 解剖图,我们可以看到⼀个完整的 matplotlib 图像通常会包括以下四个层级,这些\\n层级也被称为容器( container ),下⼀节会详细介绍。在 matplotlib 的世界中,我们将通过各种命令⽅法来操纵图像中的每⼀个部分,\\n从⽽达到数据可视化的最终效果,⼀副完整的图像实际上是各类⼦元素的集合。\\nFigure:顶层级,⽤来容纳所有绘图元素\\uf03a Contents \\n⼀、认识matplotlib\\n⼆、⼀个最简单的绘图例⼦\\n三、Figure 的组成', metadata={'source': 'docs/matplotlib/第一回:Matplotlib初相识.pdf', 'page': 0})"
|
||||
]
|
||||
},
|
||||
"execution_count": 53,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"docs_chinese[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 54,
|
||||
"id": "270f0a10",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Document(page_content='第⼀回:Matplotlib 初相识\\n⼀、认识matplotlib\\nMatplotlib 是⼀个 Python 2D 绘图库,能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形,⽤来绘制各种静态,动态,\\n交互式的图表。\\nMatplotlib 可⽤于 Python 脚本, Python 和 IPython Shell 、 Jupyter notebook , Web 应⽤程序服务器和各种图形⽤户界⾯⼯具包等。\\nMatplotlib 是 Python 数据可视化库中的泰⽃,它已经成为 python 中公认的数据可视化⼯具,我们所熟知的 pandas 和 seaborn 的绘图接⼝\\n其实也是基于 matplotlib 所作的⾼级封装。\\n为了对matplotlib 有更好的理解,让我们从⼀些最基本的概念开始认识它,再逐渐过渡到⼀些⾼级技巧中。\\n⼆、⼀个最简单的绘图例⼦\\nMatplotlib 的图像是画在 figure (如 windows , jupyter 窗体)上的,每⼀个 figure ⼜包含了⼀个或多个 axes (⼀个可以指定坐标系的⼦区\\n域)。最简单的创建 figure 以及 axes 的⽅式是通过 pyplot.subplots命令,创建 axes 以后,可以使⽤ Axes.plot绘制最简易的折线图。\\nimport matplotlib.pyplot as plt\\nimport matplotlib as mpl\\nimport numpy as np\\nfig, ax = plt.subplots() # 创建⼀个包含⼀个 axes 的 figure\\nax.plot([1, 2, 3, 4], [1, 4, 2, 3]); # 绘制图像\\nTrick: 在jupyter notebook 中使⽤ matplotlib 时会发现,代码运⾏后⾃动打印出类似 <matplotlib.lines.Line2D at 0x23155916dc0>\\n这样⼀段话,这是因为 matplotlib 的绘图代码默认打印出最后⼀个对象。如果不想显示这句话,有以下三种⽅法,在本章节的代码示例\\n中你能找到这三种⽅法的使⽤。\\n\\x00. 在代码块最后加⼀个分号 ;\\n\\x00. 在代码块最后加⼀句 plt.show()\\n\\x00. 在绘图时将绘图对象显式赋值给⼀个变量,如将 plt.plot([1, 2, 3, 4]) 改成 line =plt.plot([1, 2, 3, 4])\\n和MATLAB 命令类似,你还可以通过⼀种更简单的⽅式绘制图像, matplotlib.pyplot⽅法能够直接在当前 axes 上绘制图像,如果⽤户\\n未指定axes , matplotlib 会帮你⾃动创建⼀个。所以上⾯的例⼦也可以简化为以下这⼀⾏代码。\\nline =plt.plot([1, 2, 3, 4], [1, 4, 2, 3]) \\n三、Figure 的组成\\n现在我们来深⼊看⼀下 figure 的组成。通过⼀张 figure 解剖图,我们可以看到⼀个完整的 matplotlib 图像通常会包括以下四个层级,这些\\n层级也被称为容器( container ),下⼀节会详细介绍。在 matplotlib 的世界中,我们将通过各种命令⽅法来操纵图像中的每⼀个部分,\\n从⽽达到数据可视化的最终效果,⼀副完整的图像实际上是各类⼦元素的集合。\\nFigure:顶层级,⽤来容纳所有绘图元素\\uf03a Contents \\n⼀、认识matplotlib\\n⼆、⼀个最简单的绘图例⼦\\n三、Figure 的组成', metadata={'source': 'docs/matplotlib/第一回:Matplotlib初相识.pdf', 'page': 0})"
|
||||
]
|
||||
},
|
||||
"execution_count": 54,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"docs_chinese[1]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3a3a915d",
|
||||
@ -810,6 +1180,99 @@
|
||||
"print(docs[4].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 74,
|
||||
"id": "e0a7e9c8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"question_chinese = \"他们在第二讲中对Figure说了些什么?\" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 75,
|
||||
"id": "62c0bfc6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docs_chinese = vectordb_chinese.similarity_search(question_chinese,k=5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 76,
|
||||
"id": "26bd586d",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"{'source': 'docs/matplotlib/第一回:Matplotlib初相识.pdf', 'page': 0}\n",
|
||||
"{'source': 'docs/matplotlib/第一回:Matplotlib初相识.pdf', 'page': 0}\n",
|
||||
"{'source': 'docs/matplotlib/第二回:艺术画笔见乾坤.pdf', 'page': 9}\n",
|
||||
"{'source': 'docs/matplotlib/第二回:艺术画笔见乾坤.pdf', 'page': 10}\n",
|
||||
"{'source': 'docs/matplotlib/第一回:Matplotlib初相识.pdf', 'page': 1}\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for doc_chinese in docs_chinese:\n",
|
||||
" print(doc_chinese.metadata)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 77,
|
||||
"id": "71b6bbf3",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"三、对象容器 - Object container\n",
|
||||
"容器会包含⼀些 primitives,并且容器还有它⾃身的属性。\n",
|
||||
"⽐如Axes Artist,它是⼀种容器,它包含了很多 primitives,⽐如Line2D,Text;同时,它也有⾃身的属性,⽐如 xscal,⽤来控制\n",
|
||||
"X轴是linear还是log的。\n",
|
||||
"1. Figure容器\n",
|
||||
"matplotlib.figure.Figure是Artist最顶层的 container对象容器,它包含了图表中的所有元素。⼀张图表的背景就是在\n",
|
||||
"Figure.patch的⼀个矩形 Rectangle。\n",
|
||||
"当我们向图表添加 Figure.add_subplot()或者Figure.add_axes()元素时,这些都会被添加到 Figure.axes列表中。\n",
|
||||
"fig = plt.figure()\n",
|
||||
"ax1 = fig.add_subplot(211) # 作⼀幅2*1 的图,选择第 1 个⼦图\n",
|
||||
"ax2 = fig.add_axes([0.1, 0.1, 0.7, 0.3]) # 位置参数,四个数分别代表了\n",
|
||||
"(left,bottom,width,height)\n",
|
||||
"print(ax1) \n",
|
||||
"print(fig.axes) # fig.axes 中包含了 subplot 和 axes 两个实例 , 刚刚添加的\n",
|
||||
"AxesSubplot(0.125,0.536818;0.775x0.343182)\n",
|
||||
"[<AxesSubplot:>, <Axes:>]\n",
|
||||
"由于Figure维持了current axes,因此你不应该⼿动的从 Figure.axes列表中添加删除元素,⽽是要通过 Figure.add_subplot()、\n",
|
||||
"Figure.add_axes()来添加元素,通过 Figure.delaxes()来删除元素。但是你可以迭代或者访问 Figure.axes中的Axes,然后修改这个\n",
|
||||
"Axes的属性。\n",
|
||||
"⽐如下⾯的遍历 axes ⾥的内容,并且添加⽹格线:\n",
|
||||
"fig = plt.figure()\n",
|
||||
"ax1 = fig.add_subplot(211)\n",
|
||||
"for ax in fig.axes:\n",
|
||||
" ax.grid(True)\n",
|
||||
"Figure也有它⾃⼰的 text、line 、 patch 、 image。你可以直接通过 add primitive语句直接添加。但是注意 Figure默认的坐标系是以像\n",
|
||||
"素为单位,你可能需要转换成 figure 坐标系: (0,0) 表示左下点, (1,1) 表示右上点。\n",
|
||||
"Figure容器的常⻅属性:\n",
|
||||
"Figure.patch属性:Figure 的背景矩形\n",
|
||||
"Figure.axes属性:⼀个 Axes 实例的列表(包括 Subplot)\n",
|
||||
"Figure.images属性:⼀个 FigureImages patch 列表\n",
|
||||
"Figure.lines属性:⼀个 Line2D 实例的列表(很少使⽤)\n",
|
||||
"Figure.legends属性:⼀个 Figure Legend 实例列表(不同于 Axes.legends)\n",
|
||||
"Figure.texts属性:⼀个 Figure Text 实例列表\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(docs_chinese[2].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c3dbca56",
|
||||
@ -817,16 +1280,6 @@
|
||||
"source": [
|
||||
"在下一讲中讨论的方法可以用来解决这两个问题!"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "8cdc67ae-59e0-4a84-9fe6-e3a0391114d0",
|
||||
"metadata": {
|
||||
"height": 30
|
||||
},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Reference in New Issue
Block a user