prompt-engineering-for-deve…/content/LangChain Chat with Data/2.文档加载 Document Loading.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cc2eb3ad-8c1c-406a-b7aa-a3f61b754ac5",
   "metadata": {},
   "source": [
    "# 第一章 文档加载\n",
    "文本加载器(Document Loaders) 可以处理不同类型的数据类型。数据类型可以是结构化/非结构化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "582125c3-2afb-4cca-b651-c1810a5e5c22",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q langchain --upgrade"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bb73a77f-e17c-45a2-b456-e3ad2bf0fb5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import openai\n",
    "import sys\n",
    "sys.path.append('../..')\n",
    "\n",
    "from dotenv import load_dotenv, find_dotenv\n",
    "_ = load_dotenv(find_dotenv()) \n",
    "\n",
    "openai.api_key  = os.environ['OPENAI_API_KEY']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63558db2-5279-4c1b-9bec-355ab04731e6",
   "metadata": {},
   "source": [
    "## 一、PDF文档\n",
    "\n",
    "首先，我们来加载一个[PDF文档](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf)。该文档为吴恩达教授的2009年机器学习课程的字幕文件。因为这些字幕为自动生成，所以词句直接可能不太连贯和通畅。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd5fe85c-6aae-4739-9b47-68e791afc9ac",
   "metadata": {},
   "source": [
    "### 1.1 安装相关包 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c527f944-35dc-44a2-9cf9-9887cf315f3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q pypdf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8dcb2102-0414-4130-952b-3b6fa33b61bb",
   "metadata": {},
   "source": [
    "### 1.2 加载PDF文档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "52d9891f-a8cc-47c4-8c09-81794647a720",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import PyPDFLoader\n",
    "\n",
    "# 创建一个 PyPDFLoader Class 实例，输入为待加载的pdf文档路径\n",
    "loader = PyPDFLoader(\"docs/cs229_lectures/MachineLearning-Lecture01.pdf\")\n",
    "\n",
    "# 调用 PyPDFLoader Class 的函数 load对pdf文件进行加载\n",
    "pages = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68d40600-49ab-42a3-97d0-b9a2c4ab8139",
   "metadata": {},
   "source": [
    "### 1.3 探索加载的数据"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "feca9f1e-1596-49f2-a6d9-6eeaeffbd90b",
   "metadata": {},
   "source": [
    "文档加载后储存在`pages`变量中:\n",
    "- `page`的变量类型为`List`\n",
    "- 打印 `pages` 的长度可以看到pdf一共包含多少页"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9463b982-c71c-4241-b3a3-b040170eef2e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'list'>\n"
     ]
    }
   ],
   "source": [
    "print(type(pages))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "67a2b815-586f-43a5-96a4-cfe46001a766",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "22\n"
     ]
    }
   ],
   "source": [
    "print(len(pages))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2cde6b9d-71c8-4851-a8f6-a3f0e76f6dab",
   "metadata": {},
   "source": [
    "`page`中的每一元素为一个文档，变量类型为`langchain.schema.Document`, 文档变量类型包含两个属性\n",
    "- `page_content` 包含该文档的内容。\n",
    "- `meta_data` 为文档相关的描述性数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "921827ae-3a5b-4f29-b015-a5dde3be1410",
   "metadata": {},
   "outputs": [],
   "source": [
    "page = pages[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "aadaa840-0f30-4ae3-b06b-7fe8f468d146",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'langchain.schema.document.Document'>\n"
     ]
    }
   ],
   "source": [
    "print(type(page))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "85777ce2-42c7-4e11-b1ba-06fd6a0d8502",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MachineLearning-Lecture01  \n",
      "Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \n",
      "learning class. So what I wanna do today is ju st spend a little time going over the logistics \n",
      "of the class, and then we'll start to  talk a bit about machine learning.  \n",
      "By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so \n",
      "I personally work in machine learning, and I' ve worked on it for about 15 years now, and \n",
      "I actually think that machine learning i\n"
     ]
    }
   ],
   "source": [
    "print(page.page_content[0:500])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "8a1f8acd-f8c7-46af-a29f-df172067deba",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}\n"
     ]
    }
   ],
   "source": [
    "print(page.metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e9cead7-a967-4a8f-8d3d-0f94f2ff129e",
   "metadata": {},
   "source": [
    "## 二、YouTube音频\n",
    "\n",
    "在第一部分的内容，我们学习了如何加载PDF文档。在这部分的内容，我们学习对于给定的 YouTube 视频链接\n",
    "- 如何使用LongChain加载器将视频的音频下载到本地\n",
    "- 然后使用OpenAIWhisperPaser解析器将音频转化为文本"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4720268-ddab-4c18-9072-10aab8f0ac7c",
   "metadata": {},
   "source": [
    "### 2.1 安装相关包 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "37dbeb50-d6c5-4db0-88da-1ef9d3e47417",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip -q install yt_dlp\n",
    "!pip -q install pydub"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a243b258-eae3-46b0-803f-cd897b31cf78",
   "metadata": {},
   "source": [
    "### 2.2 加载Youtube音频文档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "f25593f9-a6d2-4137-94cb-881141ca99fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders.generic import GenericLoader\n",
    "from langchain.document_loaders.parsers import OpenAIWhisperParser\n",
    "from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5ca7b99a-ba4d-4989-aed6-be76acb405c0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I\n",
      "[youtube] jGwO_UgTS7I: Downloading webpage\n",
      "[youtube] jGwO_UgTS7I: Downloading ios player API JSON\n",
      "[youtube] jGwO_UgTS7I: Downloading android player API JSON\n",
      "[youtube] jGwO_UgTS7I: Downloading m3u8 information\n",
      "[info] jGwO_UgTS7I: Downloading 1 format(s): 140\n",
      "[download] docs/youtube//Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded\n",
      "[download] 100% of   69.76MiB\n",
      "[ExtractAudio] Not converting audio docs/youtube//Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a\n",
      "Transcribing part 1!\n",
      "Transcribing part 2!\n",
      "Transcribing part 3!\n",
      "Transcribing part 4!\n"
     ]
    }
   ],
   "source": [
    "url=\"https://www.youtube.com/watch?v=jGwO_UgTS7I\"\n",
    "save_dir=\"docs/youtube/\"\n",
    "\n",
    "# 创建一个 GenericLoader Class 实例\n",
    "loader = GenericLoader(\n",
    "    #将链接url中的Youtube视频的音频下载下来,存在本地路径save_dir\n",
    "    YoutubeAudioLoader([url],save_dir), \n",
    "    \n",
    "    #使用OpenAIWhisperPaser解析器将音频转化为文本\n",
    "    OpenAIWhisperParser()\n",
    ")\n",
    "\n",
    "# 调用 GenericLoader Class 的函数 load对视频的音频文件进行加载\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffb4db5d-39b5-4cd7-82d9-824ed71fc116",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 2.3 探索加载的数据"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fd91c34-ac19-4a09-8ca0-99262011d9ba",
   "metadata": {},
   "source": [
    "文档加载后储存在`docs`变量中:\n",
    "- `docs`的变量类型为`List`\n",
    "- 打印 `docs` 的长度可以看到一共包含多少页"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "ddb89cee-32bd-4c5f-91f1-c46d1f0300da",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'list'>\n"
     ]
    }
   ],
   "source": [
    "print(type(docs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "2cea4ff3-8548-4158-9e55-a574de0fd29e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    }
   ],
   "source": [
    "print(len(docs))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24952655-128e-4a7b-b8c0-e93156acbe1b",
   "metadata": {},
   "source": [
    "`docs`中的每一元素为一个文档，变量类型为`langchain.schema.document.Document`, 文档变量类型包含两个属性\n",
    "- `page_content` 包含该文档的内容。\n",
    "- `meta_data` 为文档相关的描述性数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "d89bf91d-d39b-4682-9c56-0cde449d6051",
   "metadata": {},
   "outputs": [],
   "source": [
    "doc = docs[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "e6df253f-ad9e-42d5-b6e1-47b7c0d2d564",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'langchain.schema.document.Document'>\n"
     ]
    }
   ],
   "source": [
    "print(type(doc))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "80e4b798-875e-4f0e-ba16-5277f8ec1f62",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Welcome to CS229 Machine Learning. Uh, some of you know that this is a class that's taught at Stanford for a long time. And this is often the class that, um, I most look forward to teaching each year because this is where we've helped, I think, several generations of Stanford students become experts in machine learning, got- built many of their products and services and startups that I'm sure, many of you or probably all of you are using, uh, uh, today. Um, so what I want to do today was spend s\n"
     ]
    }
   ],
   "source": [
    "print(doc.page_content[0:500])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "7f363e33-6a4d-4b78-aa7d-1b8cf6b59567",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'source': 'docs/youtube/Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a', 'chunk': 0}\n"
     ]
    }
   ],
   "source": [
    "print(doc.metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b7ddc7d-2d40-4811-8cb3-5e73344ebe24",
   "metadata": {},
   "source": [
    "## 三、网页文档\n",
    "\n",
    "在第二部分，我们对于给定的 YouTube 视频链接 (URL)，使用 LongChain 加载器将视频的音频下载到本地，然后使用 OpenAIWhisperPaser 解析器将音频转化为文本。\n",
    "\n",
    "本部分，对于给定网页文档链接(URLs)，我们学习如何对其进行加载。这里我们对Github上的网页文档进行加载，该文档格式为markdown。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b28abf4d-4907-47f6-b54d-6d322a5794e6",
   "metadata": {},
   "source": [
    "### 3.1 加载网页文档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "1a68375f-44ae-4905-bf9c-1f01ec800481",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import WebBaseLoader\n",
    "\n",
    "\n",
    "# 创建一个 WebBaseLoader Class 实例\n",
    "url = \"https://github.com/basecamp/handbook/blob/master/37signals-is-you.md\"\n",
    "header = {'User-Agent': 'python-requests/2.27.1', \n",
    "          'Accept-Encoding': 'gzip, deflate, br', \n",
    "          'Accept': '*/*',\n",
    "          'Connection': 'keep-alive'}\n",
    "loader = WebBaseLoader(web_path=url,header_template=header)\n",
    "\n",
    "# 调用 WebBaseLoader Class 的函数 load对文件进行加载\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc24f44a-01f5-49a3-9529-2f05c1053b2c",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 3.2 探索加载的数据"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2f108b9-713b-4b98-b44d-4dfc3dcbcde2",
   "metadata": {},
   "source": [
    "文档加载后储存在`docs`变量中:\n",
    "- `docs`的变量类型为`List`\n",
    "- 打印 `docs` 的长度可以看到一共包含多少页"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "8c8670fa-203b-4c35-9266-f976f50f0f5d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'list'>\n"
     ]
    }
   ],
   "source": [
    "print(type(docs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "0e85a526-55e7-4186-8697-d16a891bcabc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    }
   ],
   "source": [
    "print(len(docs))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4397791-4d3c-4609-86be-3cda90a3f2fc",
   "metadata": {},
   "source": [
    "`docs`中的每一元素为一个文档，变量类型为`langchain.schema.document.Document`, 文档变量类型包含两个属性\n",
    "- `page_content` 包含该文档的内容。\n",
    "- `meta_data` 为文档相关的描述性数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "26423c51-6503-478c-b17d-bbe57049a04c",
   "metadata": {},
   "outputs": [],
   "source": [
    "doc = docs[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "a243638f-0c23-46b8-8854-13f1f1de6f0a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'langchain.schema.document.Document'>\n"
     ]
    }
   ],
   "source": [
    "print(type(doc))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "9f237541-79b7-4ae4-9e0e-27a28af99b7a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"payload\":{\"allShortcutsEnabled\":false,\"fileTree\":{\"\":{\"items\":[{\"name\":\"37signals-is-you.md\",\"path\":\"37signals-is-you.md\",\"contentType\":\"file\"},{\"name\":\"LICENSE.md\",\"path\":\"LICENSE.md\",\"contentType\":\"file\"},{\"name\":\"README.md\",\"path\":\"README.md\",\"contentType\":\"file\"},{\"name\":\"benefits-and-perks.md\",\"path\":\"benefits-and-perks.md\",\"contentType\":\"file\"},{\"name\":\"code-of-conduct.md\",\"path\":\"code-of-conduct.md\",\"contentType\":\"file\"},{\"name\":\"faq.md\",\"path\":\"faq.md\",\"contentType\":\"file\"},{\"name\":\"getting-started.md\",\"path\":\"getting-started.md\",\"contentType\":\"file\"},{\"name\":\"how-we-work.md\",\"path\":\"how-we-work.md\",\"contentType\":\"file\"},{\"name\":\"international-travel-guide.md\",\"path\":\"international-travel-guide.md\",\"contentType\":\"file\"},{\"name\":\"making-a-career.md\",\"path\":\"making-a-career.md\",\"contentType\":\"file\"},{\"name\":\"managing-work-devices.md\",\"path\":\"managing-work-devices.md\",\"contentType\":\"file\"},{\"name\":\"moonlighting.md\",\"path\":\"moonlighting.md\",\"contentType\":\"file\"},{\"name\":\"our-internal-systems.md\",\"path\":\"our-internal-systems.md\",\"contentType\":\"file\"},{\"name\":\"our-rituals.md\",\"path\":\"our-rituals.md\",\"contentType\":\"file\"},{\"name\":\"performance-plans.md\",\"path\":\"performance-plans.md\",\"contentType\":\"file\"},{\"name\":\"product-histories.md\",\"path\":\"product-histories.md\",\"contentType\":\"file\"},{\"name\":\"stateFMLA.md\",\"path\":\"stateFMLA.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-data.md\",\"path\":\"titles-for-data.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-designers.md\",\"path\":\"titles-for-designers.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-ops.md\",\"path\":\"titles-for-ops.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-programmers.md\",\"path\":\"titles-for-programmers.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-support.md\",\"path\":\"titles-for-support.md\",\"contentType\":\"file\"},{\"name\":\"vocabulary.md\",\"path\":\"vocabulary.md\",\"contentType\":\"file\"},{\"name\":\"what-influenced-us.md\",\"path\":\"what-influenced-us.md\",\"contentType\":\"file\"},{\"name\":\"what-we-stand-for.md\",\"path\":\"what-we-stand-for.md\",\"contentType\":\"file\"},{\"name\":\"where-we-work.md\",\"path\":\"where-we-work.md\",\"contentType\":\"file\"}],\"totalCount\":26}},\"fileTreeProcessingTime\":3.936437,\"foldersToFetch\":[],\"reducedMotionEnabled\":null,\"repo\":{\"id\":90042196,\"defaultBranch\":\"master\",\"name\":\"handbook\",\"ownerLogin\":\"basecamp\",\"currentUserCanPush\":false,\"isFork\":false,\"isEmpty\":false,\"createdAt\":\"2017-05-02T14:23:23.000Z\",\"ownerAvatar\":\"https://avatars.githubusercontent.com/u/13131?v=4\",\"public\":true,\"private\":false,\"isOrgOwned\":true},\"refInfo\":{\"name\":\"master\",\"listCacheKey\":\"v0:1682672280.0\",\"canEdit\":false,\"refType\":\"branch\",\"currentOid\":\"1577f27c63aa8df61996924824afb8df6f1bf20e\"},\"path\":\"37signals-is-you.md\",\"currentUser\":null,\"blob\":{\"rawBlob\":null,\"colorizedLines\":null,\"stylingDirectives\":null,\"csv\":null,\"csvError\":null,\"dependabotInfo\":{\"showConfigurationBanner\":false,\"configFilePath\":null,\"networkDependabotPath\":\"/basecamp/handbook/network/updates\",\"dismissConfigurationNoticePath\":\"/settings/dismiss-notice/dependabot_configuration_notice\",\"configurationNoticeDismissed\":null,\"repoAlertsPath\":\"/basecamp/handbook/security/dependabot\",\"repoSecurityAndAnalysisPath\":\"/basecamp/handbook/settings/security_analysis\",\"repoOwnerIsOrg\":true,\"currentUserCanAdminRepo\":false},\"displayName\":\"37signals-is-you.md\",\"displayUrl\":\"https://github.com/basecamp/handbook/blob/master/37signals-is-you.md?raw=true\",\"headerInfo\":{\"blobSize\":\"2.19 KB\",\"deleteInfo\":{\"deletePath\":null,\"deleteTooltip\":\"You must be signed in to make or propose changes\"},\"editInfo\":{\"editTooltip\":\"You must be signed in to make or propose changes\"},\"ghDesktopPath\":\"https://desktop.github.com\",\"gitLfsPath\":null,\"onBranch\":true,\"shortPath\":\"e5ca0f0\",\"siteNavLoginPath\":\"/login?return_to=https%3A%2F%2Fgithub.com%2Fbasecamp%2Fhandbook%2Fblob%2Fmaster%2F37signals-is-you.md\",\"isCSV\":false,\"isRichtext\":true,\"toc\":[{\"level\":1,\"text\":\"37signals Is You\",\"anchor\":\"37signals-is-you\",\"htmlText\":\"37signals Is You\"}],\"lineInfo\":{\"truncatedLoc\":\"11\",\"truncatedSloc\":\"6\"},\"mode\":\"file\"},\"image\":false,\"isCodeownersFile\":null,\"isValidLegacyIssueTemplate\":false,\"issueTemplateHelpUrl\":\"https://docs.github.com/articles/about-issue-and-pull-request-templates\",\"issueTemplate\":null,\"discussionTemplate\":null,\"language\":\"Markdown\",\"large\":false,\"loggedIn\":false,\"newDiscussionPath\":\"/basecamp/handbook/discussions/new\",\"newIssuePath\":\"/basecamp/handbook/issues/new\",\"planSupportInfo\":{\"repoIsFork\":null,\"repoOwnedByCurrentUser\":null,\"requestFullPath\":\"/basecamp/handbook/blob/master/37signals-is-you.md\",\"showFreeOrgGatedFeatureMessage\":null,\"showPlanSupportBanner\":null,\"upgradeDataAttributes\":null,\"upgradePath\":null},\"publishBannersInfo\":{\"dismissActionNoticePath\":\"/settings/dismiss-notice/publish_action_from_dockerfile\",\"dismissStackNoticePath\":\"/settings/dismiss-notice/publish_stack_from_file\",\"releasePath\":\"/basecamp/handbook/releases/new?marketplace=true\",\"showPublishActionBanner\":false,\"showPublishStackBanner\":false},\"renderImageOrRaw\":false,\"richText\":\"37signals Is You\\nEveryone working at 37signals represents 37signals. When a customer gets a response from Merissa on support, Merissa is 37signals. When a customer reads a tweet by Eron that our systems are down, Eron is 37signals. In those situations, all the other stuff we do to cultivate our best image is secondary. What’s right in front of someone in a time of need is what they’ll remember.\\nThat’s what we mean when we say marketing is everyone’s responsibility, and that it pays to spend the time to recognize that. This means avoiding the bullshit of outage language and bending our policies, not just lending your ears. It means taking the time to get the writing right and consider how you’d feel if you were on the other side of the interaction.\\nThe vast majority of our customers come from word of mouth and much of that word comes from people in our audience. This is an audience we’ve been educating and entertaining for 20 years and counting, and your voice is part of us now, whether you like it or not! Tell us and our audience what you have to say!\\nThis goes for tools and techniques as much as it goes for prose. 37signals not only tries to out-teach the competition, but also out-share and out-collaborate. We’re prolific open source contributors through Ruby on Rails, Trix, Turbolinks, Stimulus, and many other projects. Extracting the common infrastructure that others could use as well is satisfying, important work, and we should continue to do that.\\nIt’s also worth mentioning that joining 37signals can be all-consuming. We’ve seen it happen. You dig 37signals, so you feel pressure to contribute, maybe overwhelmingly so. The people who work here are some of the best and brightest in our industry, so the self-imposed burden to be exceptional is real. But here’s the thing: stop it. Settle in. We’re glad you love this job because we all do too, but at the end of the day it’s a job. Do your best work, collaborate with your team, write, read, learn, and then turn off your computer and play with your dog. We’ll all be better for it.\\n\",\"renderedFileInfo\":null,\"tabSize\":8,\"topBannersInfo\":{\"overridingGlobalFundingFile\":false,\"globalPreferredFundingPath\":null,\"repoOwner\":\"basecamp\",\"repoName\":\"handbook\",\"showInvalidCitationWarning\":false,\"citationHelpUrl\":\"https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/about-citation-files\",\"showDependabotConfigurationBanner\":false,\"actionsOnboardingTip\":null},\"truncated\":false,\"viewable\":true,\"workflowRedirectUrl\":null,\"symbols\":{\"timedOut\":false,\"notAnalyzed\":true,\"symbols\":[]}},\"csrf_tokens\":{\"/basecamp/handbook/branches\":{\"post\":\"o3HTNEDyuKtINffBkguVz-P3KUwBN04ZM_vvyoNKymcy66lDUtXVvEi7EvsbgFoz2d3qgU_earsuIftbbtKlcg\"}}},\"title\":\"handbook/37signals-is-you.md at master · basecamp/handbook\",\"locale\":\"en\"}\n"
     ]
    }
   ],
   "source": [
    "print(doc.page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26538237-1fc2-4915-944a-1f68f3ae3759",
   "metadata": {},
   "source": [
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52025103-205e-4137-a116-89f37fcfece1",
   "metadata": {},
   "source": [
    "可以看到上面的文档内容包含许多冗余的信息。通常来讲，我们需要进行对这种数据进行进一步处理(Post Processing)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "a7c5281f-aeed-4ee7-849b-bbf9fd3e35c7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "37signals Is You\n",
      "Everyone working at 37signals represents 37signals. When a customer gets a response from Merissa on support, Merissa is 37signals. When a customer reads a tweet by Eron that our systems are down, Eron is 37signals. In those situations, all the other stuff we do to cultivate our best image is secondary. What’s right in front of someone in a time of need is what they’ll remember.\n",
      "That’s what we mean when we say marketing is everyone’s responsibility, and that it pays to spend the time to recognize that. This means avoiding the bullshit of outage language and bending our policies, not just lending your ears. It means taking the time to get the writing right and consider how you’d feel if you were on the other side of the interaction.\n",
      "The vast majority of our customers come from word of mouth and much of that word comes from people in our audience. This is an audience we’ve been educating and entertaining for 20 years and counting, and your voice is part of us now, whether you like it or not! Tell us and our audience what you have to say!\n",
      "This goes for tools and techniques as much as it goes for prose. 37signals not only tries to out-teach the competition, but also out-share and out-collaborate. We’re prolific open source contributors through Ruby on Rails, Trix, Turbolinks, Stimulus, and many other projects. Extracting the common infrastructure that others could use as well is satisfying, important work, and we should continue to do that.\n",
      "It’s also worth mentioning that joining 37signals can be all-consuming. We’ve seen it happen. You dig 37signals, so you feel pressure to contribute, maybe overwhelmingly so. The people who work here are some of the best and brightest in our industry, so the self-imposed burden to be exceptional is real. But here’s the thing: stop it. Settle in. We’re glad you love this job because we all do too, but at the end of the day it’s a job. Do your best work, collaborate with your team, write, read, learn, and then turn off your computer and play with your dog. We’ll all be better for it.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "convert_to_json = json.loads(doc.page_content)\n",
    "extracted_markdow = convert_to_json['payload']['blob']['richText']\n",
    "print(extracted_markdow)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d35e99b4-dd67-4940-bbeb-b2a59bf8cd3d",
   "metadata": {},
   "source": [
    "## 四、Notion文档\n",
    "\n",
    "- 点击[Notion示例文档](https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f)右上方复制按钮(Duplicate)，复制文档到你的Notion空间\n",
    "- 点击右上方`⋯` 按钮，选择导出为Mardown&CSV。导出的文件将为zip文件夹\n",
    "- 解压并保存mardown文档到本地路径`docs/Notion_DB/`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8cf2778-288c-4964-81e7-0ed881e31652",
   "metadata": {},
   "source": [
    "### 4.1 加载Notion Markdown文档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "081f5ee4-6b5d-45bf-a7e6-079abc560729",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import NotionDirectoryLoader\n",
    "loader = NotionDirectoryLoader(\"docs/Notion_DB\")\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88d5d094-a490-4c64-ab93-5c5cec0853aa",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 4.2 探索加载的数据"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3ffe318-d22a-4687-b613-f679ad9ad616",
   "metadata": {},
   "source": [
    "文档加载后储存在`docs`变量中:\n",
    "- `docs`的变量类型为`List`\n",
    "- 打印 `docs` 的长度可以看到一共包含多少页"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "106323dd-0d24-40d4-8302-ed2a35d13347",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'list'>\n"
     ]
    }
   ],
   "source": [
    "print(type(docs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "fa3ee0fe-7daa-4193-9ee1-4ee89cb3b843",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    }
   ],
   "source": [
    "print(len(docs))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3cd3feb-d2c1-46b6-8b16-d4b2101c5632",
   "metadata": {},
   "source": [
    "`docs`中的每一元素为一个文档，变量类型为`langchain.schema.document.Document`, 文档变量类型包含两个属性\n",
    "- `page_content` 包含该文档的内容。\n",
    "- `meta_data` 为文档相关的描述性数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "5b220df6-fd0b-4b62-9da4-29962926fe87",
   "metadata": {},
   "outputs": [],
   "source": [
    "doc = docs[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "98ef8fbf-e820-41de-919d-c462a910f4f1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'langchain.schema.document.Document'>\n"
     ]
    }
   ],
   "source": [
    "print(type(doc))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "933da0f3-4d5f-4363-9142-050ecf226c1f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Blendle's Employee Handbook\n",
      "\n",
      "This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change. \n",
      "\n",
      "**Everything related to working at Blendle and the people of Blendle, made public.**\n",
      "\n",
      "These are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pv\n"
     ]
    }
   ],
   "source": [
    "print(doc.page_content[0:500])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}