Files
prompt-engineering-for-deve…/content/LangChain Chat with Your Data/2.文档加载 Document Loading.ipynb
2023-07-15 22:45:48 +08:00

1 line
32 KiB
Plaintext

{"cells": [{"cell_type": "markdown", "id": "cc2eb3ad-8c1c-406a-b7aa-a3f61b754ac5", "metadata": {}, "source": ["# \u7b2c\u4e8c\u7ae0 \u6587\u6863\u52a0\u8f7d\n", "\n", " - [\u4e00\u3001PDF\u6587\u6863](#\u4e00\u3001PDF\u6587\u6863)\n", " - [1.1 \u5b89\u88c5\u76f8\u5173\u5305 ](#1.1-\u5b89\u88c5\u76f8\u5173\u5305-)\n", " - [1.2 \u52a0\u8f7dPDF\u6587\u6863](#1.2-\u52a0\u8f7dPDF\u6587\u6863)\n", " - [1.3 \u63a2\u7d22\u52a0\u8f7d\u7684\u6570\u636e](#1.3-\u63a2\u7d22\u52a0\u8f7d\u7684\u6570\u636e)\n", " - [\u4e8c\u3001YouTube\u97f3\u9891](#\u4e8c\u3001YouTube\u97f3\u9891)\n", " - [2.1 \u5b89\u88c5\u76f8\u5173\u5305 ](#2.1-\u5b89\u88c5\u76f8\u5173\u5305-)\n", " - [2.2 \u52a0\u8f7dYoutube\u97f3\u9891\u6587\u6863](#2.2-\u52a0\u8f7dYoutube\u97f3\u9891\u6587\u6863)\n", " - [2.3 \u63a2\u7d22\u52a0\u8f7d\u7684\u6570\u636e](#2.3-\u63a2\u7d22\u52a0\u8f7d\u7684\u6570\u636e)\n", " - [\u4e09\u3001\u7f51\u9875\u6587\u6863](#\u4e09\u3001\u7f51\u9875\u6587\u6863)\n", " - [3.1 \u52a0\u8f7d\u7f51\u9875\u6587\u6863](#3.1-\u52a0\u8f7d\u7f51\u9875\u6587\u6863)\n", " - [3.2 \u63a2\u7d22\u52a0\u8f7d\u7684\u6570\u636e](#3.2-\u63a2\u7d22\u52a0\u8f7d\u7684\u6570\u636e)\n", " - [\u56db\u3001Notion\u6587\u6863](#\u56db\u3001Notion\u6587\u6863)\n", " - [4.1 \u52a0\u8f7dNotion Markdown\u6587\u6863](#4.1-\u52a0\u8f7dNotion-Markdown\u6587\u6863)\n", " - [4.2 \u63a2\u7d22\u52a0\u8f7d\u7684\u6570\u636e](#4.2-\u63a2\u7d22\u52a0\u8f7d\u7684\u6570\u636e)\n"]}, {"cell_type": "markdown", "id": "89e612a5", "metadata": {}, "source": ["\u540c\u524d\u5e8f\u8bfe\u7a0b\u4e00\u6837\uff0c\u6211\u4eec\u540c\u6837\u9700\u8981\u5148\u914d\u7f6e\u73af\u5883\u4ee5\u8bbf\u95ee OpenAI \u63d0\u4f9b\u7684 API\uff0c\u914d\u7f6e\u65b9\u6cd5\u5982\u4e0b\uff0c\u8be6\u7ec6\u4ecb\u7ecd\u8bf7\u53c2\u89c1\u5176\u4ed6\u8bfe\u7a0b"]}, {"cell_type": "code", "execution_count": 1, "id": "582125c3-2afb-4cca-b651-c1810a5e5c22", "metadata": {}, "outputs": [], "source": ["# \u4e0b\u8f7d\u6700\u65b0\u7248 LangChain\n", "!pip install -q langchain --upgrade"]}, {"cell_type": "code", "execution_count": 2, "id": "bb73a77f-e17c-45a2-b456-e3ad2bf0fb5c", "metadata": {}, "outputs": [], "source": ["import os\n", "import openai\n", "import sys\n", "sys.path.append('../..')\n", "\n", "from dotenv import load_dotenv, find_dotenv\n", "_ = load_dotenv(find_dotenv()) \n", "\n", "openai.api_key = os.environ['OPENAI_API_KEY']"]}, {"cell_type": "markdown", "id": "63558db2-5279-4c1b-9bec-355ab04731e6", "metadata": {}, "source": ["## \u4e00\u3001PDF\u6587\u6863\n", "\n", "\u9996\u5148\uff0c\u6211\u4eec\u6765\u52a0\u8f7d\u4e00\u4e2a[PDF\u6587\u6863](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf)\u3002\u8be5\u6587\u6863\u4e3a\u5434\u6069\u8fbe\u6559\u6388\u76842009\u5e74\u673a\u5668\u5b66\u4e60\u8bfe\u7a0b\u7684\u5b57\u5e55\u6587\u4ef6\u3002\u56e0\u4e3a\u8fd9\u4e9b\u5b57\u5e55\u4e3a\u81ea\u52a8\u751f\u6210\uff0c\u6240\u4ee5\u8bcd\u53e5\u76f4\u63a5\u53ef\u80fd\u4e0d\u592a\u8fde\u8d2f\u548c\u901a\u7545\u3002"]}, {"cell_type": "markdown", "id": "dd5fe85c-6aae-4739-9b47-68e791afc9ac", "metadata": {}, "source": ["### 1.1 \u5b89\u88c5\u76f8\u5173\u5305 "]}, {"cell_type": "code", "execution_count": 3, "id": "c527f944-35dc-44a2-9cf9-9887cf315f3a", "metadata": {}, "outputs": [], "source": ["!pip install -q pypdf"]}, {"cell_type": "markdown", "id": "8dcb2102-0414-4130-952b-3b6fa33b61bb", "metadata": {}, "source": ["### 1.2 \u52a0\u8f7dPDF\u6587\u6863"]}, {"cell_type": "code", "execution_count": 4, "id": "52d9891f-a8cc-47c4-8c09-81794647a720", "metadata": {}, "outputs": [], "source": ["from langchain.document_loaders import PyPDFLoader\n", "\n", "# \u521b\u5efa\u4e00\u4e2a PyPDFLoader Class \u5b9e\u4f8b\uff0c\u8f93\u5165\u4e3a\u5f85\u52a0\u8f7d\u7684pdf\u6587\u6863\u8def\u5f84\n", "loader = PyPDFLoader(\"docs/cs229_lectures/MachineLearning-Lecture01.pdf\")\n", "\n", "# \u8c03\u7528 PyPDFLoader Class \u7684\u51fd\u6570 load\u5bf9pdf\u6587\u4ef6\u8fdb\u884c\u52a0\u8f7d\n", "pages = loader.load()"]}, {"cell_type": "markdown", "id": "68d40600-49ab-42a3-97d0-b9a2c4ab8139", "metadata": {}, "source": ["### 1.3 \u63a2\u7d22\u52a0\u8f7d\u7684\u6570\u636e"]}, {"cell_type": "markdown", "id": "feca9f1e-1596-49f2-a6d9-6eeaeffbd90b", "metadata": {}, "source": ["\u6587\u6863\u52a0\u8f7d\u540e\u50a8\u5b58\u5728`pages`\u53d8\u91cf\u4e2d:\n", "- `page`\u7684\u53d8\u91cf\u7c7b\u578b\u4e3a`List`\n", "- \u6253\u5370 `pages` \u7684\u957f\u5ea6\u53ef\u4ee5\u770b\u5230pdf\u4e00\u5171\u5305\u542b\u591a\u5c11\u9875"]}, {"cell_type": "code", "execution_count": 5, "id": "9463b982-c71c-4241-b3a3-b040170eef2e", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["<class 'list'>\n"]}], "source": ["print(type(pages))"]}, {"cell_type": "code", "execution_count": 6, "id": "67a2b815-586f-43a5-96a4-cfe46001a766", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["22\n"]}], "source": ["print(len(pages))"]}, {"cell_type": "markdown", "id": "2cde6b9d-71c8-4851-a8f6-a3f0e76f6dab", "metadata": {}, "source": ["`page`\u4e2d\u7684\u6bcf\u4e00\u5143\u7d20\u4e3a\u4e00\u4e2a\u6587\u6863\uff0c\u53d8\u91cf\u7c7b\u578b\u4e3a`langchain.schema.Document`, \u6587\u6863\u53d8\u91cf\u7c7b\u578b\u5305\u542b\u4e24\u4e2a\u5c5e\u6027\n", "- `page_content` \u5305\u542b\u8be5\u6587\u6863\u7684\u5185\u5bb9\u3002\n", "- `meta_data` \u4e3a\u6587\u6863\u76f8\u5173\u7684\u63cf\u8ff0\u6027\u6570\u636e\u3002"]}, {"cell_type": "code", "execution_count": 7, "id": "921827ae-3a5b-4f29-b015-a5dde3be1410", "metadata": {}, "outputs": [], "source": ["page = pages[0]"]}, {"cell_type": "code", "execution_count": 8, "id": "aadaa840-0f30-4ae3-b06b-7fe8f468d146", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["<class 'langchain.schema.document.Document'>\n"]}], "source": ["print(type(page))"]}, {"cell_type": "code", "execution_count": 9, "id": "85777ce2-42c7-4e11-b1ba-06fd6a0d8502", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["MachineLearning-Lecture01 \n", "Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine \n", "learning class. So what I wanna do today is ju st spend a little time going over the logistics \n", "of the class, and then we'll start to talk a bit about machine learning. \n", "By way of introduction, my name's Andrew Ng and I'll be instru ctor for this class. And so \n", "I personally work in machine learning, and I' ve worked on it for about 15 years now, and \n", "I actually think that machine learning i\n"]}], "source": ["print(page.page_content[0:500])"]}, {"cell_type": "code", "execution_count": 10, "id": "8a1f8acd-f8c7-46af-a29f-df172067deba", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}\n"]}], "source": ["print(page.metadata)"]}, {"cell_type": "markdown", "id": "1e9cead7-a967-4a8f-8d3d-0f94f2ff129e", "metadata": {}, "source": ["## \u4e8c\u3001YouTube\u97f3\u9891\n", "\n", "\u5728\u7b2c\u4e00\u90e8\u5206\u7684\u5185\u5bb9\uff0c\u6211\u4eec\u5b66\u4e60\u4e86\u5982\u4f55\u52a0\u8f7dPDF\u6587\u6863\u3002\u5728\u8fd9\u90e8\u5206\u7684\u5185\u5bb9\uff0c\u6211\u4eec\u5b66\u4e60\u5bf9\u4e8e\u7ed9\u5b9a\u7684 YouTube \u89c6\u9891\u94fe\u63a5\n", "- \u5982\u4f55\u4f7f\u7528LongChain\u52a0\u8f7d\u5668\u5c06\u89c6\u9891\u7684\u97f3\u9891\u4e0b\u8f7d\u5230\u672c\u5730\n", "- \u7136\u540e\u4f7f\u7528OpenAIWhisperPaser\u89e3\u6790\u5668\u5c06\u97f3\u9891\u8f6c\u5316\u4e3a\u6587\u672c"]}, {"cell_type": "markdown", "id": "b4720268-ddab-4c18-9072-10aab8f0ac7c", "metadata": {}, "source": ["### 2.1 \u5b89\u88c5\u76f8\u5173\u5305 "]}, {"cell_type": "code", "execution_count": 11, "id": "37dbeb50-d6c5-4db0-88da-1ef9d3e47417", "metadata": {}, "outputs": [], "source": ["!pip -q install yt_dlp\n", "!pip -q install pydub"]}, {"cell_type": "markdown", "id": "a243b258-eae3-46b0-803f-cd897b31cf78", "metadata": {}, "source": ["### 2.2 \u52a0\u8f7dYoutube\u97f3\u9891\u6587\u6863"]}, {"cell_type": "code", "execution_count": 12, "id": "f25593f9-a6d2-4137-94cb-881141ca99fd", "metadata": {}, "outputs": [], "source": ["from langchain.document_loaders.generic import GenericLoader\n", "from langchain.document_loaders.parsers import OpenAIWhisperParser\n", "from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader"]}, {"cell_type": "code", "execution_count": 2, "id": "5ca7b99a-ba4d-4989-aed6-be76acb405c0", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I\n", "[youtube] jGwO_UgTS7I: Downloading webpage\n", "[youtube] jGwO_UgTS7I: Downloading ios player API JSON\n", "[youtube] jGwO_UgTS7I: Downloading android player API JSON\n", "[youtube] jGwO_UgTS7I: Downloading m3u8 information\n", "[info] jGwO_UgTS7I: Downloading 1 format(s): 140\n", "[download] docs/youtube//Stanford CS229\uff1a Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded\n", "[download] 100% of 69.76MiB\n", "[ExtractAudio] Not converting audio docs/youtube//Stanford CS229\uff1a Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a\n", "Transcribing part 1!\n", "Transcribing part 2!\n", "Transcribing part 3!\n", "Transcribing part 4!\n"]}], "source": ["url=\"https://www.youtube.com/watch?v=jGwO_UgTS7I\"\n", "save_dir=\"docs/youtube/\"\n", "\n", "# \u521b\u5efa\u4e00\u4e2a GenericLoader Class \u5b9e\u4f8b\n", "loader = GenericLoader(\n", " #\u5c06\u94fe\u63a5url\u4e2d\u7684Youtube\u89c6\u9891\u7684\u97f3\u9891\u4e0b\u8f7d\u4e0b\u6765,\u5b58\u5728\u672c\u5730\u8def\u5f84save_dir\n", " YoutubeAudioLoader([url],save_dir), \n", " \n", " #\u4f7f\u7528OpenAIWhisperPaser\u89e3\u6790\u5668\u5c06\u97f3\u9891\u8f6c\u5316\u4e3a\u6587\u672c\n", " OpenAIWhisperParser()\n", ")\n", "\n", "# \u8c03\u7528 GenericLoader Class \u7684\u51fd\u6570 load\u5bf9\u89c6\u9891\u7684\u97f3\u9891\u6587\u4ef6\u8fdb\u884c\u52a0\u8f7d\n", "docs = loader.load()"]}, {"cell_type": "markdown", "id": "ffb4db5d-39b5-4cd7-82d9-824ed71fc116", "metadata": {"tags": []}, "source": ["### 2.3 \u63a2\u7d22\u52a0\u8f7d\u7684\u6570\u636e"]}, {"cell_type": "markdown", "id": "0fd91c34-ac19-4a09-8ca0-99262011d9ba", "metadata": {}, "source": ["\u6587\u6863\u52a0\u8f7d\u540e\u50a8\u5b58\u5728`docs`\u53d8\u91cf\u4e2d:\n", "- `docs`\u7684\u53d8\u91cf\u7c7b\u578b\u4e3a`List`\n", "- \u6253\u5370 `docs` \u7684\u957f\u5ea6\u53ef\u4ee5\u770b\u5230\u4e00\u5171\u5305\u542b\u591a\u5c11\u9875"]}, {"cell_type": "code", "execution_count": 15, "id": "ddb89cee-32bd-4c5f-91f1-c46d1f0300da", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["<class 'list'>\n"]}], "source": ["print(type(docs))"]}, {"cell_type": "code", "execution_count": 16, "id": "2cea4ff3-8548-4158-9e55-a574de0fd29e", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["1\n"]}], "source": ["print(len(docs))"]}, {"cell_type": "markdown", "id": "24952655-128e-4a7b-b8c0-e93156acbe1b", "metadata": {}, "source": ["`docs`\u4e2d\u7684\u6bcf\u4e00\u5143\u7d20\u4e3a\u4e00\u4e2a\u6587\u6863\uff0c\u53d8\u91cf\u7c7b\u578b\u4e3a`langchain.schema.document.Document`, \u6587\u6863\u53d8\u91cf\u7c7b\u578b\u5305\u542b\u4e24\u4e2a\u5c5e\u6027\n", "- `page_content` \u5305\u542b\u8be5\u6587\u6863\u7684\u5185\u5bb9\u3002\n", "- `meta_data` \u4e3a\u6587\u6863\u76f8\u5173\u7684\u63cf\u8ff0\u6027\u6570\u636e\u3002"]}, {"cell_type": "code", "execution_count": 19, "id": "d89bf91d-d39b-4682-9c56-0cde449d6051", "metadata": {}, "outputs": [], "source": ["doc = docs[0]"]}, {"cell_type": "code", "execution_count": 20, "id": "e6df253f-ad9e-42d5-b6e1-47b7c0d2d564", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["<class 'langchain.schema.document.Document'>\n"]}], "source": ["print(type(doc))"]}, {"cell_type": "code", "execution_count": 13, "id": "80e4b798-875e-4f0e-ba16-5277f8ec1f62", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Welcome to CS229 Machine Learning. Uh, some of you know that this is a class that's taught at Stanford for a long time. And this is often the class that, um, I most look forward to teaching each year because this is where we've helped, I think, several generations of Stanford students become experts in machine learning, got- built many of their products and services and startups that I'm sure, many of you or probably all of you are using, uh, uh, today. Um, so what I want to do today was spend s\n"]}], "source": ["print(doc.page_content[0:500])"]}, {"cell_type": "code", "execution_count": 14, "id": "7f363e33-6a4d-4b78-aa7d-1b8cf6b59567", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'source': 'docs/youtube/Stanford CS229\uff1a Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a', 'chunk': 0}\n"]}], "source": ["print(doc.metadata)"]}, {"cell_type": "markdown", "id": "5b7ddc7d-2d40-4811-8cb3-5e73344ebe24", "metadata": {}, "source": ["## \u4e09\u3001\u7f51\u9875\u6587\u6863\n", "\n", "\u5728\u7b2c\u4e8c\u90e8\u5206\uff0c\u6211\u4eec\u5bf9\u4e8e\u7ed9\u5b9a\u7684 YouTube \u89c6\u9891\u94fe\u63a5 (URL)\uff0c\u4f7f\u7528 LongChain \u52a0\u8f7d\u5668\u5c06\u89c6\u9891\u7684\u97f3\u9891\u4e0b\u8f7d\u5230\u672c\u5730\uff0c\u7136\u540e\u4f7f\u7528 OpenAIWhisperPaser \u89e3\u6790\u5668\u5c06\u97f3\u9891\u8f6c\u5316\u4e3a\u6587\u672c\u3002\n", "\n", "\u672c\u90e8\u5206\uff0c\u5bf9\u4e8e\u7ed9\u5b9a\u7f51\u9875\u6587\u6863\u94fe\u63a5(URLs)\uff0c\u6211\u4eec\u5b66\u4e60\u5982\u4f55\u5bf9\u5176\u8fdb\u884c\u52a0\u8f7d\u3002\u8fd9\u91cc\u6211\u4eec\u5bf9Github\u4e0a\u7684\u7f51\u9875\u6587\u6863\u8fdb\u884c\u52a0\u8f7d\uff0c\u8be5\u6587\u6863\u683c\u5f0f\u4e3amarkdown\u3002"]}, {"cell_type": "markdown", "id": "b28abf4d-4907-47f6-b54d-6d322a5794e6", "metadata": {}, "source": ["### 3.1 \u52a0\u8f7d\u7f51\u9875\u6587\u6863"]}, {"cell_type": "code", "execution_count": 21, "id": "1a68375f-44ae-4905-bf9c-1f01ec800481", "metadata": {}, "outputs": [], "source": ["from langchain.document_loaders import WebBaseLoader\n", "\n", "\n", "# \u521b\u5efa\u4e00\u4e2a WebBaseLoader Class \u5b9e\u4f8b\n", "url = \"https://github.com/basecamp/handbook/blob/master/37signals-is-you.md\"\n", "header = {'User-Agent': 'python-requests/2.27.1', \n", " 'Accept-Encoding': 'gzip, deflate, br', \n", " 'Accept': '*/*',\n", " 'Connection': 'keep-alive'}\n", "loader = WebBaseLoader(web_path=url,header_template=header)\n", "\n", "# \u8c03\u7528 WebBaseLoader Class \u7684\u51fd\u6570 load\u5bf9\u6587\u4ef6\u8fdb\u884c\u52a0\u8f7d\n", "docs = loader.load()"]}, {"cell_type": "markdown", "id": "fc24f44a-01f5-49a3-9529-2f05c1053b2c", "metadata": {"tags": []}, "source": ["### 3.2 \u63a2\u7d22\u52a0\u8f7d\u7684\u6570\u636e"]}, {"cell_type": "markdown", "id": "f2f108b9-713b-4b98-b44d-4dfc3dcbcde2", "metadata": {}, "source": ["\u6587\u6863\u52a0\u8f7d\u540e\u50a8\u5b58\u5728`docs`\u53d8\u91cf\u4e2d:\n", "- `docs`\u7684\u53d8\u91cf\u7c7b\u578b\u4e3a`List`\n", "- \u6253\u5370 `docs` \u7684\u957f\u5ea6\u53ef\u4ee5\u770b\u5230\u4e00\u5171\u5305\u542b\u591a\u5c11\u9875"]}, {"cell_type": "code", "execution_count": 22, "id": "8c8670fa-203b-4c35-9266-f976f50f0f5d", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["<class 'list'>\n"]}], "source": ["print(type(docs))"]}, {"cell_type": "code", "execution_count": 23, "id": "0e85a526-55e7-4186-8697-d16a891bcabc", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["1\n"]}], "source": ["print(len(docs))"]}, {"cell_type": "markdown", "id": "b4397791-4d3c-4609-86be-3cda90a3f2fc", "metadata": {}, "source": ["`docs`\u4e2d\u7684\u6bcf\u4e00\u5143\u7d20\u4e3a\u4e00\u4e2a\u6587\u6863\uff0c\u53d8\u91cf\u7c7b\u578b\u4e3a`langchain.schema.document.Document`, \u6587\u6863\u53d8\u91cf\u7c7b\u578b\u5305\u542b\u4e24\u4e2a\u5c5e\u6027\n", "- `page_content` \u5305\u542b\u8be5\u6587\u6863\u7684\u5185\u5bb9\u3002\n", "- `meta_data` \u4e3a\u6587\u6863\u76f8\u5173\u7684\u63cf\u8ff0\u6027\u6570\u636e\u3002"]}, {"cell_type": "code", "execution_count": 24, "id": "26423c51-6503-478c-b17d-bbe57049a04c", "metadata": {}, "outputs": [], "source": ["doc = docs[0]"]}, {"cell_type": "code", "execution_count": 25, "id": "a243638f-0c23-46b8-8854-13f1f1de6f0a", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["<class 'langchain.schema.document.Document'>\n"]}], "source": ["print(type(doc))"]}, {"cell_type": "code", "execution_count": 26, "id": "9f237541-79b7-4ae4-9e0e-27a28af99b7a", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{\"payload\":{\"allShortcutsEnabled\":false,\"fileTree\":{\"\":{\"items\":[{\"name\":\"37signals-is-you.md\",\"path\":\"37signals-is-you.md\",\"contentType\":\"file\"},{\"name\":\"LICENSE.md\",\"path\":\"LICENSE.md\",\"contentType\":\"file\"},{\"name\":\"README.md\",\"path\":\"README.md\",\"contentType\":\"file\"},{\"name\":\"benefits-and-perks.md\",\"path\":\"benefits-and-perks.md\",\"contentType\":\"file\"},{\"name\":\"code-of-conduct.md\",\"path\":\"code-of-conduct.md\",\"contentType\":\"file\"},{\"name\":\"faq.md\",\"path\":\"faq.md\",\"contentType\":\"file\"},{\"name\":\"getting-started.md\",\"path\":\"getting-started.md\",\"contentType\":\"file\"},{\"name\":\"how-we-work.md\",\"path\":\"how-we-work.md\",\"contentType\":\"file\"},{\"name\":\"international-travel-guide.md\",\"path\":\"international-travel-guide.md\",\"contentType\":\"file\"},{\"name\":\"making-a-career.md\",\"path\":\"making-a-career.md\",\"contentType\":\"file\"},{\"name\":\"managing-work-devices.md\",\"path\":\"managing-work-devices.md\",\"contentType\":\"file\"},{\"name\":\"moonlighting.md\",\"path\":\"moonlighting.md\",\"contentType\":\"file\"},{\"name\":\"our-internal-systems.md\",\"path\":\"our-internal-systems.md\",\"contentType\":\"file\"},{\"name\":\"our-rituals.md\",\"path\":\"our-rituals.md\",\"contentType\":\"file\"},{\"name\":\"performance-plans.md\",\"path\":\"performance-plans.md\",\"contentType\":\"file\"},{\"name\":\"product-histories.md\",\"path\":\"product-histories.md\",\"contentType\":\"file\"},{\"name\":\"stateFMLA.md\",\"path\":\"stateFMLA.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-data.md\",\"path\":\"titles-for-data.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-designers.md\",\"path\":\"titles-for-designers.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-ops.md\",\"path\":\"titles-for-ops.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-programmers.md\",\"path\":\"titles-for-programmers.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-support.md\",\"path\":\"titles-for-support.md\",\"contentType\":\"file\"},{\"name\":\"vocabulary.md\",\"path\":\"vocabulary.md\",\"contentType\":\"file\"},{\"name\":\"what-influenced-us.md\",\"path\":\"what-influenced-us.md\",\"contentType\":\"file\"},{\"name\":\"what-we-stand-for.md\",\"path\":\"what-we-stand-for.md\",\"contentType\":\"file\"},{\"name\":\"where-we-work.md\",\"path\":\"where-we-work.md\",\"contentType\":\"file\"}],\"totalCount\":26}},\"fileTreeProcessingTime\":3.936437,\"foldersToFetch\":[],\"reducedMotionEnabled\":null,\"repo\":{\"id\":90042196,\"defaultBranch\":\"master\",\"name\":\"handbook\",\"ownerLogin\":\"basecamp\",\"currentUserCanPush\":false,\"isFork\":false,\"isEmpty\":false,\"createdAt\":\"2017-05-02T14:23:23.000Z\",\"ownerAvatar\":\"https://avatars.githubusercontent.com/u/13131?v=4\",\"public\":true,\"private\":false,\"isOrgOwned\":true},\"refInfo\":{\"name\":\"master\",\"listCacheKey\":\"v0:1682672280.0\",\"canEdit\":false,\"refType\":\"branch\",\"currentOid\":\"1577f27c63aa8df61996924824afb8df6f1bf20e\"},\"path\":\"37signals-is-you.md\",\"currentUser\":null,\"blob\":{\"rawBlob\":null,\"colorizedLines\":null,\"stylingDirectives\":null,\"csv\":null,\"csvError\":null,\"dependabotInfo\":{\"showConfigurationBanner\":false,\"configFilePath\":null,\"networkDependabotPath\":\"/basecamp/handbook/network/updates\",\"dismissConfigurationNoticePath\":\"/settings/dismiss-notice/dependabot_configuration_notice\",\"configurationNoticeDismissed\":null,\"repoAlertsPath\":\"/basecamp/handbook/security/dependabot\",\"repoSecurityAndAnalysisPath\":\"/basecamp/handbook/settings/security_analysis\",\"repoOwnerIsOrg\":true,\"currentUserCanAdminRepo\":false},\"displayName\":\"37signals-is-you.md\",\"displayUrl\":\"https://github.com/basecamp/handbook/blob/master/37signals-is-you.md?raw=true\",\"headerInfo\":{\"blobSize\":\"2.19 KB\",\"deleteInfo\":{\"deletePath\":null,\"deleteTooltip\":\"You must be signed in to make or propose changes\"},\"editInfo\":{\"editTooltip\":\"You must be signed in to make or propose changes\"},\"ghDesktopPath\":\"https://desktop.github.com\",\"gitLfsPath\":null,\"onBranch\":true,\"shortPath\":\"e5ca0f0\",\"siteNavLoginPath\":\"/login?return_to=https%3A%2F%2Fgithub.com%2Fbasecamp%2Fhandbook%2Fblob%2Fmaster%2F37signals-is-you.md\",\"isCSV\":false,\"isRichtext\":true,\"toc\":[{\"level\":1,\"text\":\"37signals Is You\",\"anchor\":\"37signals-is-you\",\"htmlText\":\"37signals Is You\"}],\"lineInfo\":{\"truncatedLoc\":\"11\",\"truncatedSloc\":\"6\"},\"mode\":\"file\"},\"image\":false,\"isCodeownersFile\":null,\"isValidLegacyIssueTemplate\":false,\"issueTemplateHelpUrl\":\"https://docs.github.com/articles/about-issue-and-pull-request-templates\",\"issueTemplate\":null,\"discussionTemplate\":null,\"language\":\"Markdown\",\"large\":false,\"loggedIn\":false,\"newDiscussionPath\":\"/basecamp/handbook/discussions/new\",\"newIssuePath\":\"/basecamp/handbook/issues/new\",\"planSupportInfo\":{\"repoIsFork\":null,\"repoOwnedByCurrentUser\":null,\"requestFullPath\":\"/basecamp/handbook/blob/master/37signals-is-you.md\",\"showFreeOrgGatedFeatureMessage\":null,\"showPlanSupportBanner\":null,\"upgradeDataAttributes\":null,\"upgradePath\":null},\"publishBannersInfo\":{\"dismissActionNoticePath\":\"/settings/dismiss-notice/publish_action_from_dockerfile\",\"dismissStackNoticePath\":\"/settings/dismiss-notice/publish_stack_from_file\",\"releasePath\":\"/basecamp/handbook/releases/new?marketplace=true\",\"showPublishActionBanner\":false,\"showPublishStackBanner\":false},\"renderImageOrRaw\":false,\"richText\":\"37signals Is You\\nEveryone working at 37signals represents 37signals. When a customer gets a response from Merissa on support, Merissa is 37signals. When a customer reads a tweet by Eron that our systems are down, Eron is 37signals. In those situations, all the other stuff we do to cultivate our best image is secondary. What\u2019s right in front of someone in a time of need is what they\u2019ll remember.\\nThat\u2019s what we mean when we say marketing is everyone\u2019s responsibility, and that it pays to spend the time to recognize that. This means avoiding the bullshit of outage language and bending our policies, not just lending your ears. It means taking the time to get the writing right and consider how you\u2019d feel if you were on the other side of the interaction.\\nThe vast majority of our customers come from word of mouth and much of that word comes from people in our audience. This is an audience we\u2019ve been educating and entertaining for 20 years and counting, and your voice is part of us now, whether you like it or not! Tell us and our audience what you have to say!\\nThis goes for tools and techniques as much as it goes for prose. 37signals not only tries to out-teach the competition, but also out-share and out-collaborate. We\u2019re prolific open source contributors through Ruby on Rails, Trix, Turbolinks, Stimulus, and many other projects. Extracting the common infrastructure that others could use as well is satisfying, important work, and we should continue to do that.\\nIt\u2019s also worth mentioning that joining 37signals can be all-consuming. We\u2019ve seen it happen. You dig 37signals, so you feel pressure to contribute, maybe overwhelmingly so. The people who work here are some of the best and brightest in our industry, so the self-imposed burden to be exceptional is real. But here\u2019s the thing: stop it. Settle in. We\u2019re glad you love this job because we all do too, but at the end of the day it\u2019s a job. Do your best work, collaborate with your team, write, read, learn, and then turn off your computer and play with your dog. We\u2019ll all be better for it.\\n\",\"renderedFileInfo\":null,\"tabSize\":8,\"topBannersInfo\":{\"overridingGlobalFundingFile\":false,\"globalPreferredFundingPath\":null,\"repoOwner\":\"basecamp\",\"repoName\":\"handbook\",\"showInvalidCitationWarning\":false,\"citationHelpUrl\":\"https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/about-citation-files\",\"showDependabotConfigurationBanner\":false,\"actionsOnboardingTip\":null},\"truncated\":false,\"viewable\":true,\"workflowRedirectUrl\":null,\"symbols\":{\"timedOut\":false,\"notAnalyzed\":true,\"symbols\":[]}},\"csrf_tokens\":{\"/basecamp/handbook/branches\":{\"post\":\"o3HTNEDyuKtINffBkguVz-P3KUwBN04ZM_vvyoNKymcy66lDUtXVvEi7EvsbgFoz2d3qgU_earsuIftbbtKlcg\"}}},\"title\":\"handbook/37signals-is-you.md at master \u00b7 basecamp/handbook\",\"locale\":\"en\"}\n"]}], "source": ["print(doc.page_content)"]}, {"cell_type": "markdown", "id": "26538237-1fc2-4915-944a-1f68f3ae3759", "metadata": {}, "source": [" "]}, {"cell_type": "markdown", "id": "52025103-205e-4137-a116-89f37fcfece1", "metadata": {}, "source": ["\u53ef\u4ee5\u770b\u5230\u4e0a\u9762\u7684\u6587\u6863\u5185\u5bb9\u5305\u542b\u8bb8\u591a\u5197\u4f59\u7684\u4fe1\u606f\u3002\u901a\u5e38\u6765\u8bb2\uff0c\u6211\u4eec\u9700\u8981\u8fdb\u884c\u5bf9\u8fd9\u79cd\u6570\u636e\u8fdb\u884c\u8fdb\u4e00\u6b65\u5904\u7406(Post Processing)\u3002"]}, {"cell_type": "code", "execution_count": 27, "id": "a7c5281f-aeed-4ee7-849b-bbf9fd3e35c7", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["37signals Is You\n", "Everyone working at 37signals represents 37signals. When a customer gets a response from Merissa on support, Merissa is 37signals. When a customer reads a tweet by Eron that our systems are down, Eron is 37signals. In those situations, all the other stuff we do to cultivate our best image is secondary. What\u2019s right in front of someone in a time of need is what they\u2019ll remember.\n", "That\u2019s what we mean when we say marketing is everyone\u2019s responsibility, and that it pays to spend the time to recognize that. This means avoiding the bullshit of outage language and bending our policies, not just lending your ears. It means taking the time to get the writing right and consider how you\u2019d feel if you were on the other side of the interaction.\n", "The vast majority of our customers come from word of mouth and much of that word comes from people in our audience. This is an audience we\u2019ve been educating and entertaining for 20 years and counting, and your voice is part of us now, whether you like it or not! Tell us and our audience what you have to say!\n", "This goes for tools and techniques as much as it goes for prose. 37signals not only tries to out-teach the competition, but also out-share and out-collaborate. We\u2019re prolific open source contributors through Ruby on Rails, Trix, Turbolinks, Stimulus, and many other projects. Extracting the common infrastructure that others could use as well is satisfying, important work, and we should continue to do that.\n", "It\u2019s also worth mentioning that joining 37signals can be all-consuming. We\u2019ve seen it happen. You dig 37signals, so you feel pressure to contribute, maybe overwhelmingly so. The people who work here are some of the best and brightest in our industry, so the self-imposed burden to be exceptional is real. But here\u2019s the thing: stop it. Settle in. We\u2019re glad you love this job because we all do too, but at the end of the day it\u2019s a job. Do your best work, collaborate with your team, write, read, learn, and then turn off your computer and play with your dog. We\u2019ll all be better for it.\n", "\n"]}], "source": ["import json\n", "convert_to_json = json.loads(doc.page_content)\n", "extracted_markdow = convert_to_json['payload']['blob']['richText']\n", "print(extracted_markdow)"]}, {"cell_type": "markdown", "id": "d35e99b4-dd67-4940-bbeb-b2a59bf8cd3d", "metadata": {}, "source": ["## \u56db\u3001Notion\u6587\u6863\n", "\n", "- \u70b9\u51fb[Notion\u793a\u4f8b\u6587\u6863](https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f)\u53f3\u4e0a\u65b9\u590d\u5236\u6309\u94ae(Duplicate)\uff0c\u590d\u5236\u6587\u6863\u5230\u4f60\u7684Notion\u7a7a\u95f4\n", "- \u70b9\u51fb\u53f3\u4e0a\u65b9`\u22ef` \u6309\u94ae\uff0c\u9009\u62e9\u5bfc\u51fa\u4e3aMardown&CSV\u3002\u5bfc\u51fa\u7684\u6587\u4ef6\u5c06\u4e3azip\u6587\u4ef6\u5939\n", "- \u89e3\u538b\u5e76\u4fdd\u5b58mardown\u6587\u6863\u5230\u672c\u5730\u8def\u5f84`docs/Notion_DB/`"]}, {"cell_type": "markdown", "id": "f8cf2778-288c-4964-81e7-0ed881e31652", "metadata": {}, "source": ["### 4.1 \u52a0\u8f7dNotion Markdown\u6587\u6863"]}, {"cell_type": "code", "execution_count": 28, "id": "081f5ee4-6b5d-45bf-a7e6-079abc560729", "metadata": {}, "outputs": [], "source": ["from langchain.document_loaders import NotionDirectoryLoader\n", "loader = NotionDirectoryLoader(\"docs/Notion_DB\")\n", "docs = loader.load()"]}, {"cell_type": "markdown", "id": "88d5d094-a490-4c64-ab93-5c5cec0853aa", "metadata": {"tags": []}, "source": ["### 4.2 \u63a2\u7d22\u52a0\u8f7d\u7684\u6570\u636e"]}, {"cell_type": "markdown", "id": "a3ffe318-d22a-4687-b613-f679ad9ad616", "metadata": {}, "source": ["\u6587\u6863\u52a0\u8f7d\u540e\u50a8\u5b58\u5728`docs`\u53d8\u91cf\u4e2d:\n", "- `docs`\u7684\u53d8\u91cf\u7c7b\u578b\u4e3a`List`\n", "- \u6253\u5370 `docs` \u7684\u957f\u5ea6\u53ef\u4ee5\u770b\u5230\u4e00\u5171\u5305\u542b\u591a\u5c11\u9875"]}, {"cell_type": "code", "execution_count": 29, "id": "106323dd-0d24-40d4-8302-ed2a35d13347", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["<class 'list'>\n"]}], "source": ["print(type(docs))"]}, {"cell_type": "code", "execution_count": 30, "id": "fa3ee0fe-7daa-4193-9ee1-4ee89cb3b843", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["1\n"]}], "source": ["print(len(docs))"]}, {"cell_type": "markdown", "id": "b3cd3feb-d2c1-46b6-8b16-d4b2101c5632", "metadata": {}, "source": ["`docs`\u4e2d\u7684\u6bcf\u4e00\u5143\u7d20\u4e3a\u4e00\u4e2a\u6587\u6863\uff0c\u53d8\u91cf\u7c7b\u578b\u4e3a`langchain.schema.document.Document`, \u6587\u6863\u53d8\u91cf\u7c7b\u578b\u5305\u542b\u4e24\u4e2a\u5c5e\u6027\n", "- `page_content` \u5305\u542b\u8be5\u6587\u6863\u7684\u5185\u5bb9\u3002\n", "- `meta_data` \u4e3a\u6587\u6863\u76f8\u5173\u7684\u63cf\u8ff0\u6027\u6570\u636e\u3002"]}, {"cell_type": "code", "execution_count": 31, "id": "5b220df6-fd0b-4b62-9da4-29962926fe87", "metadata": {}, "outputs": [], "source": ["doc = docs[0]"]}, {"cell_type": "code", "execution_count": 32, "id": "98ef8fbf-e820-41de-919d-c462a910f4f1", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["<class 'langchain.schema.document.Document'>\n"]}], "source": ["print(type(doc))"]}, {"cell_type": "code", "execution_count": 33, "id": "933da0f3-4d5f-4363-9142-050ecf226c1f", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["# Blendle's Employee Handbook\n", "\n", "This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change. \n", "\n", "**Everything related to working at Blendle and the people of Blendle, made public.**\n", "\n", "These are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pv\n"]}], "source": ["print(doc.page_content[0:500])"]}], "metadata": {"kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12"}}, "nbformat": 4, "nbformat_minor": 5}