diff --git a/content/LangChain Chat with Data/1.简介 Introduction.md b/content/LangChain Chat with Data/1.简介 Introduction.md new file mode 100644 index 0000000..0c4dffa --- /dev/null +++ b/content/LangChain Chat with Data/1.简介 Introduction.md @@ -0,0 +1,34 @@ +# 第一章 简介 + +本课程由哈里森·蔡斯 (Harrison Chase,LangChain作者)与Deeplearning.ai合作开发,课程将介绍如何使用LangChain和自有数据进行对话。 + + +## 一、背景 +大语言模型(Large Language Model, LLM), 比如ChatGPT, 可以回答许多不同的问题。但是大语言模型的知识来源于其训练数据集,并没有用户的信息(比如用户的个人数据,公司的自自有数据),也没有最新发生时事的信息(在大模型数据训练后发表的文章或者新闻)。因此大模型能给出的答案比较受限。 + +如果能够让大模型结合已有信息和自有数据中的信息来进行对话,回答我们的问题,那我们便能够得到更好的答案。 + + +## 二、 课程基本内容 + +在本课程中,我们学习如何使用LangChain和自有数据进行对话。 + +LangChain是用于构建大模型应用程序的开源框架,有Python和JavaScript两个不同版本的包。LangChain基于模块化组合,有许多单独的组件,可以一起使用或单独使用。LangChain的组件包括: + +- 提示(Prompts): 使模型执行操作的方式 +- 模型(Models):大语言模型、对话模型,文本表示模型。目前包含多个模型的集成。 +- 索引(Indexes): 获取数据的方式,可以与模型结合使用 +- 链式(Chains): 端到端功能实现 +- 代理(Agents): 使用模型作为推理引擎 + +此外LangChain还拥有很多应用案例,帮助我们了解如何将这些模块化组件以链式方式组合,以形成更多端到端的应用程序 。 + +如果你想要了解关于LangChain的基础知识,可以学习使用 LangChain 开发基于 LLM 的应用程序课程(LangChain for LLM Application Development)。 + +## 三、致谢课程重要贡献者 + +最后特别感谢对本课程内容贡献者 +- Ankush Gola(LandChain) +- Lance Martin(LandChain) +- Geoff Ladwig(DeepLearning.AI) +- Diala Ezzedine(DeepLearning.AI) diff --git a/content/LangChain Chat with Data/2.文档加载 Document Loading.ipynb b/content/LangChain Chat with Data/2.文档加载 Document Loading.ipynb new file mode 100644 index 0000000..e94972e --- /dev/null +++ b/content/LangChain Chat with Data/2.文档加载 Document Loading.ipynb @@ -0,0 +1,819 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cc2eb3ad-8c1c-406a-b7aa-a3f61b754ac5", + "metadata": {}, + "source": [ + "# 第一章 文档加载\n", + "文本加载器(Document Loaders) 可以处理不同类型的数据类型。数据类型可以是结构化/非结构化" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "582125c3-2afb-4cca-b651-c1810a5e5c22", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -q langchain --upgrade" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "bb73a77f-e17c-45a2-b456-e3ad2bf0fb5c", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import openai\n", + "import sys\n", + "sys.path.append('../..')\n", + "\n", + "from dotenv import load_dotenv, find_dotenv\n", + "_ = load_dotenv(find_dotenv()) \n", + "\n", + "openai.api_key = os.environ['OPENAI_API_KEY']" + ] + }, + { + "cell_type": "markdown", + "id": "63558db2-5279-4c1b-9bec-355ab04731e6", + "metadata": {}, + "source": [ + "## 一、PDF文档\n", + "\n", + "首先,我们来加载一个[PDF文档](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf)。该文档为吴恩达教授的2009年机器学习课程的字幕文件。因为这些字幕为自动生成,所以词句直接可能不太连贯和通畅。" + ] + }, + { + "cell_type": "markdown", + "id": "dd5fe85c-6aae-4739-9b47-68e791afc9ac", + "metadata": {}, + "source": [ + "### 1.1 安装相关包 " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "c527f944-35dc-44a2-9cf9-9887cf315f3a", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -q pypdf" + ] + }, + { + "cell_type": "markdown", + "id": "8dcb2102-0414-4130-952b-3b6fa33b61bb", + "metadata": {}, + "source": [ + "### 1.2 加载PDF文档" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "52d9891f-a8cc-47c4-8c09-81794647a720", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.document_loaders import PyPDFLoader\n", + "\n", + "# 创建一个 PyPDFLoader Class 实例,输入为待加载的pdf文档路径\n", + "loader = PyPDFLoader(\"docs/cs229_lectures/MachineLearning-Lecture01.pdf\")\n", + "\n", + "# 调用 PyPDFLoader Class 的函数 load对pdf文件进行加载\n", + "pages = loader.load()" + ] + }, + { + "cell_type": "markdown", + "id": "68d40600-49ab-42a3-97d0-b9a2c4ab8139", + "metadata": {}, + "source": [ + "### 1.3 探索加载的数据" + ] + }, + { + "cell_type": "markdown", + "id": "feca9f1e-1596-49f2-a6d9-6eeaeffbd90b", + "metadata": {}, + "source": [ + "文档加载后储存在`pages`变量中:\n", + "- `page`的变量类型为`List`\n", + "- 打印 `pages` 的长度可以看到pdf一共包含多少页" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "9463b982-c71c-4241-b3a3-b040170eef2e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "print(type(pages))" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "67a2b815-586f-43a5-96a4-cfe46001a766", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "22\n" + ] + } + ], + "source": [ + "print(len(pages))" + ] + }, + { + "cell_type": "markdown", + "id": "2cde6b9d-71c8-4851-a8f6-a3f0e76f6dab", + "metadata": {}, + "source": [ + "`page`中的每一元素为一个文档,变量类型为`langchain.schema.Document`, 文档变量类型包含两个属性\n", + "- `page_content` 包含该文档的内容。\n", + "- `meta_data` 为文档相关的描述性数据。" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "921827ae-3a5b-4f29-b015-a5dde3be1410", + "metadata": {}, + "outputs": [], + "source": [ + "page = pages[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "aadaa840-0f30-4ae3-b06b-7fe8f468d146", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "print(type(page))" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "85777ce2-42c7-4e11-b1ba-06fd6a0d8502", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "MachineLearning-Lecture01 \n", + "Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine \n", + "learning class. So what I wanna do today is ju st spend a little time going over the logistics \n", + "of the class, and then we'll start to talk a bit about machine learning. \n", + "By way of introduction, my name's Andrew Ng and I'll be instru ctor for this class. And so \n", + "I personally work in machine learning, and I' ve worked on it for about 15 years now, and \n", + "I actually think that machine learning i\n" + ] + } + ], + "source": [ + "print(page.page_content[0:500])" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "8a1f8acd-f8c7-46af-a29f-df172067deba", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}\n" + ] + } + ], + "source": [ + "print(page.metadata)" + ] + }, + { + "cell_type": "markdown", + "id": "1e9cead7-a967-4a8f-8d3d-0f94f2ff129e", + "metadata": {}, + "source": [ + "## 二、YouTube音频\n", + "\n", + "在第一部分的内容,我们学习了如何加载PDF文档。在这部分的内容,我们学习对于给定的 YouTube 视频链接\n", + "- 如何使用LongChain加载器将视频的音频下载到本地\n", + "- 然后使用OpenAIWhisperPaser解析器将音频转化为文本" + ] + }, + { + "cell_type": "markdown", + "id": "b4720268-ddab-4c18-9072-10aab8f0ac7c", + "metadata": {}, + "source": [ + "### 2.1 安装相关包 " + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "37dbeb50-d6c5-4db0-88da-1ef9d3e47417", + "metadata": {}, + "outputs": [], + "source": [ + "!pip -q install yt_dlp\n", + "!pip -q install pydub" + ] + }, + { + "cell_type": "markdown", + "id": "a243b258-eae3-46b0-803f-cd897b31cf78", + "metadata": {}, + "source": [ + "### 2.2 加载Youtube音频文档" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "f25593f9-a6d2-4137-94cb-881141ca99fd", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.document_loaders.generic import GenericLoader\n", + "from langchain.document_loaders.parsers import OpenAIWhisperParser\n", + "from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "5ca7b99a-ba4d-4989-aed6-be76acb405c0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I\n", + "[youtube] jGwO_UgTS7I: Downloading webpage\n", + "[youtube] jGwO_UgTS7I: Downloading ios player API JSON\n", + "[youtube] jGwO_UgTS7I: Downloading android player API JSON\n", + "[youtube] jGwO_UgTS7I: Downloading m3u8 information\n", + "[info] jGwO_UgTS7I: Downloading 1 format(s): 140\n", + "[download] docs/youtube//Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded\n", + "[download] 100% of 69.76MiB\n", + "[ExtractAudio] Not converting audio docs/youtube//Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a\n", + "Transcribing part 1!\n", + "Transcribing part 2!\n", + "Transcribing part 3!\n", + "Transcribing part 4!\n" + ] + } + ], + "source": [ + "url=\"https://www.youtube.com/watch?v=jGwO_UgTS7I\"\n", + "save_dir=\"docs/youtube/\"\n", + "\n", + "# 创建一个 GenericLoader Class 实例\n", + "loader = GenericLoader(\n", + " #将链接url中的Youtube视频的音频下载下来,存在本地路径save_dir\n", + " YoutubeAudioLoader([url],save_dir), \n", + " \n", + " #使用OpenAIWhisperPaser解析器将音频转化为文本\n", + " OpenAIWhisperParser()\n", + ")\n", + "\n", + "# 调用 GenericLoader Class 的函数 load对视频的音频文件进行加载\n", + "docs = loader.load()" + ] + }, + { + "cell_type": "markdown", + "id": "ffb4db5d-39b5-4cd7-82d9-824ed71fc116", + "metadata": { + "tags": [] + }, + "source": [ + "### 2.3 探索加载的数据" + ] + }, + { + "cell_type": "markdown", + "id": "0fd91c34-ac19-4a09-8ca0-99262011d9ba", + "metadata": {}, + "source": [ + "文档加载后储存在`docs`变量中:\n", + "- `docs`的变量类型为`List`\n", + "- 打印 `docs` 的长度可以看到一共包含多少页" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "ddb89cee-32bd-4c5f-91f1-c46d1f0300da", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "print(type(docs))" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "2cea4ff3-8548-4158-9e55-a574de0fd29e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1\n" + ] + } + ], + "source": [ + "print(len(docs))" + ] + }, + { + "cell_type": "markdown", + "id": "24952655-128e-4a7b-b8c0-e93156acbe1b", + "metadata": {}, + "source": [ + "`docs`中的每一元素为一个文档,变量类型为`langchain.schema.document.Document`, 文档变量类型包含两个属性\n", + "- `page_content` 包含该文档的内容。\n", + "- `meta_data` 为文档相关的描述性数据。" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "d89bf91d-d39b-4682-9c56-0cde449d6051", + "metadata": {}, + "outputs": [], + "source": [ + "doc = docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "e6df253f-ad9e-42d5-b6e1-47b7c0d2d564", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "print(type(doc))" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "80e4b798-875e-4f0e-ba16-5277f8ec1f62", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Welcome to CS229 Machine Learning. Uh, some of you know that this is a class that's taught at Stanford for a long time. And this is often the class that, um, I most look forward to teaching each year because this is where we've helped, I think, several generations of Stanford students become experts in machine learning, got- built many of their products and services and startups that I'm sure, many of you or probably all of you are using, uh, uh, today. Um, so what I want to do today was spend s\n" + ] + } + ], + "source": [ + "print(doc.page_content[0:500])" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "7f363e33-6a4d-4b78-aa7d-1b8cf6b59567", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'source': 'docs/youtube/Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a', 'chunk': 0}\n" + ] + } + ], + "source": [ + "print(doc.metadata)" + ] + }, + { + "cell_type": "markdown", + "id": "5b7ddc7d-2d40-4811-8cb3-5e73344ebe24", + "metadata": {}, + "source": [ + "## 三、网页文档\n", + "\n", + "在第二部分,我们对于给定的 YouTube 视频链接 (URL),使用 LongChain 加载器将视频的音频下载到本地,然后使用 OpenAIWhisperPaser 解析器将音频转化为文本。\n", + "\n", + "本部分,对于给定网页文档链接(URLs),我们学习如何对其进行加载。这里我们对Github上的网页文档进行加载,该文档格式为markdown。" + ] + }, + { + "cell_type": "markdown", + "id": "b28abf4d-4907-47f6-b54d-6d322a5794e6", + "metadata": {}, + "source": [ + "### 3.1 加载网页文档" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "1a68375f-44ae-4905-bf9c-1f01ec800481", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.document_loaders import WebBaseLoader\n", + "\n", + "\n", + "# 创建一个 WebBaseLoader Class 实例\n", + "url = \"https://github.com/basecamp/handbook/blob/master/37signals-is-you.md\"\n", + "header = {'User-Agent': 'python-requests/2.27.1', \n", + " 'Accept-Encoding': 'gzip, deflate, br', \n", + " 'Accept': '*/*',\n", + " 'Connection': 'keep-alive'}\n", + "loader = WebBaseLoader(web_path=url,header_template=header)\n", + "\n", + "# 调用 WebBaseLoader Class 的函数 load对文件进行加载\n", + "docs = loader.load()" + ] + }, + { + "cell_type": "markdown", + "id": "fc24f44a-01f5-49a3-9529-2f05c1053b2c", + "metadata": { + "tags": [] + }, + "source": [ + "### 3.2 探索加载的数据" + ] + }, + { + "cell_type": "markdown", + "id": "f2f108b9-713b-4b98-b44d-4dfc3dcbcde2", + "metadata": {}, + "source": [ + "文档加载后储存在`docs`变量中:\n", + "- `docs`的变量类型为`List`\n", + "- 打印 `docs` 的长度可以看到一共包含多少页" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "8c8670fa-203b-4c35-9266-f976f50f0f5d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "print(type(docs))" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "0e85a526-55e7-4186-8697-d16a891bcabc", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1\n" + ] + } + ], + "source": [ + "print(len(docs))" + ] + }, + { + "cell_type": "markdown", + "id": "b4397791-4d3c-4609-86be-3cda90a3f2fc", + "metadata": {}, + "source": [ + "`docs`中的每一元素为一个文档,变量类型为`langchain.schema.document.Document`, 文档变量类型包含两个属性\n", + "- `page_content` 包含该文档的内容。\n", + "- `meta_data` 为文档相关的描述性数据。" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "26423c51-6503-478c-b17d-bbe57049a04c", + "metadata": {}, + "outputs": [], + "source": [ + "doc = docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "a243638f-0c23-46b8-8854-13f1f1de6f0a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "print(type(doc))" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "9f237541-79b7-4ae4-9e0e-27a28af99b7a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"payload\":{\"allShortcutsEnabled\":false,\"fileTree\":{\"\":{\"items\":[{\"name\":\"37signals-is-you.md\",\"path\":\"37signals-is-you.md\",\"contentType\":\"file\"},{\"name\":\"LICENSE.md\",\"path\":\"LICENSE.md\",\"contentType\":\"file\"},{\"name\":\"README.md\",\"path\":\"README.md\",\"contentType\":\"file\"},{\"name\":\"benefits-and-perks.md\",\"path\":\"benefits-and-perks.md\",\"contentType\":\"file\"},{\"name\":\"code-of-conduct.md\",\"path\":\"code-of-conduct.md\",\"contentType\":\"file\"},{\"name\":\"faq.md\",\"path\":\"faq.md\",\"contentType\":\"file\"},{\"name\":\"getting-started.md\",\"path\":\"getting-started.md\",\"contentType\":\"file\"},{\"name\":\"how-we-work.md\",\"path\":\"how-we-work.md\",\"contentType\":\"file\"},{\"name\":\"international-travel-guide.md\",\"path\":\"international-travel-guide.md\",\"contentType\":\"file\"},{\"name\":\"making-a-career.md\",\"path\":\"making-a-career.md\",\"contentType\":\"file\"},{\"name\":\"managing-work-devices.md\",\"path\":\"managing-work-devices.md\",\"contentType\":\"file\"},{\"name\":\"moonlighting.md\",\"path\":\"moonlighting.md\",\"contentType\":\"file\"},{\"name\":\"our-internal-systems.md\",\"path\":\"our-internal-systems.md\",\"contentType\":\"file\"},{\"name\":\"our-rituals.md\",\"path\":\"our-rituals.md\",\"contentType\":\"file\"},{\"name\":\"performance-plans.md\",\"path\":\"performance-plans.md\",\"contentType\":\"file\"},{\"name\":\"product-histories.md\",\"path\":\"product-histories.md\",\"contentType\":\"file\"},{\"name\":\"stateFMLA.md\",\"path\":\"stateFMLA.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-data.md\",\"path\":\"titles-for-data.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-designers.md\",\"path\":\"titles-for-designers.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-ops.md\",\"path\":\"titles-for-ops.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-programmers.md\",\"path\":\"titles-for-programmers.md\",\"contentType\":\"file\"},{\"name\":\"titles-for-support.md\",\"path\":\"titles-for-support.md\",\"contentType\":\"file\"},{\"name\":\"vocabulary.md\",\"path\":\"vocabulary.md\",\"contentType\":\"file\"},{\"name\":\"what-influenced-us.md\",\"path\":\"what-influenced-us.md\",\"contentType\":\"file\"},{\"name\":\"what-we-stand-for.md\",\"path\":\"what-we-stand-for.md\",\"contentType\":\"file\"},{\"name\":\"where-we-work.md\",\"path\":\"where-we-work.md\",\"contentType\":\"file\"}],\"totalCount\":26}},\"fileTreeProcessingTime\":3.936437,\"foldersToFetch\":[],\"reducedMotionEnabled\":null,\"repo\":{\"id\":90042196,\"defaultBranch\":\"master\",\"name\":\"handbook\",\"ownerLogin\":\"basecamp\",\"currentUserCanPush\":false,\"isFork\":false,\"isEmpty\":false,\"createdAt\":\"2017-05-02T14:23:23.000Z\",\"ownerAvatar\":\"https://avatars.githubusercontent.com/u/13131?v=4\",\"public\":true,\"private\":false,\"isOrgOwned\":true},\"refInfo\":{\"name\":\"master\",\"listCacheKey\":\"v0:1682672280.0\",\"canEdit\":false,\"refType\":\"branch\",\"currentOid\":\"1577f27c63aa8df61996924824afb8df6f1bf20e\"},\"path\":\"37signals-is-you.md\",\"currentUser\":null,\"blob\":{\"rawBlob\":null,\"colorizedLines\":null,\"stylingDirectives\":null,\"csv\":null,\"csvError\":null,\"dependabotInfo\":{\"showConfigurationBanner\":false,\"configFilePath\":null,\"networkDependabotPath\":\"/basecamp/handbook/network/updates\",\"dismissConfigurationNoticePath\":\"/settings/dismiss-notice/dependabot_configuration_notice\",\"configurationNoticeDismissed\":null,\"repoAlertsPath\":\"/basecamp/handbook/security/dependabot\",\"repoSecurityAndAnalysisPath\":\"/basecamp/handbook/settings/security_analysis\",\"repoOwnerIsOrg\":true,\"currentUserCanAdminRepo\":false},\"displayName\":\"37signals-is-you.md\",\"displayUrl\":\"https://github.com/basecamp/handbook/blob/master/37signals-is-you.md?raw=true\",\"headerInfo\":{\"blobSize\":\"2.19 KB\",\"deleteInfo\":{\"deletePath\":null,\"deleteTooltip\":\"You must be signed in to make or propose changes\"},\"editInfo\":{\"editTooltip\":\"You must be signed in to make or propose changes\"},\"ghDesktopPath\":\"https://desktop.github.com\",\"gitLfsPath\":null,\"onBranch\":true,\"shortPath\":\"e5ca0f0\",\"siteNavLoginPath\":\"/login?return_to=https%3A%2F%2Fgithub.com%2Fbasecamp%2Fhandbook%2Fblob%2Fmaster%2F37signals-is-you.md\",\"isCSV\":false,\"isRichtext\":true,\"toc\":[{\"level\":1,\"text\":\"37signals Is You\",\"anchor\":\"37signals-is-you\",\"htmlText\":\"37signals Is You\"}],\"lineInfo\":{\"truncatedLoc\":\"11\",\"truncatedSloc\":\"6\"},\"mode\":\"file\"},\"image\":false,\"isCodeownersFile\":null,\"isValidLegacyIssueTemplate\":false,\"issueTemplateHelpUrl\":\"https://docs.github.com/articles/about-issue-and-pull-request-templates\",\"issueTemplate\":null,\"discussionTemplate\":null,\"language\":\"Markdown\",\"large\":false,\"loggedIn\":false,\"newDiscussionPath\":\"/basecamp/handbook/discussions/new\",\"newIssuePath\":\"/basecamp/handbook/issues/new\",\"planSupportInfo\":{\"repoIsFork\":null,\"repoOwnedByCurrentUser\":null,\"requestFullPath\":\"/basecamp/handbook/blob/master/37signals-is-you.md\",\"showFreeOrgGatedFeatureMessage\":null,\"showPlanSupportBanner\":null,\"upgradeDataAttributes\":null,\"upgradePath\":null},\"publishBannersInfo\":{\"dismissActionNoticePath\":\"/settings/dismiss-notice/publish_action_from_dockerfile\",\"dismissStackNoticePath\":\"/settings/dismiss-notice/publish_stack_from_file\",\"releasePath\":\"/basecamp/handbook/releases/new?marketplace=true\",\"showPublishActionBanner\":false,\"showPublishStackBanner\":false},\"renderImageOrRaw\":false,\"richText\":\"37signals Is You\\nEveryone working at 37signals represents 37signals. When a customer gets a response from Merissa on support, Merissa is 37signals. When a customer reads a tweet by Eron that our systems are down, Eron is 37signals. In those situations, all the other stuff we do to cultivate our best image is secondary. What’s right in front of someone in a time of need is what they’ll remember.\\nThat’s what we mean when we say marketing is everyone’s responsibility, and that it pays to spend the time to recognize that. This means avoiding the bullshit of outage language and bending our policies, not just lending your ears. It means taking the time to get the writing right and consider how you’d feel if you were on the other side of the interaction.\\nThe vast majority of our customers come from word of mouth and much of that word comes from people in our audience. This is an audience we’ve been educating and entertaining for 20 years and counting, and your voice is part of us now, whether you like it or not! Tell us and our audience what you have to say!\\nThis goes for tools and techniques as much as it goes for prose. 37signals not only tries to out-teach the competition, but also out-share and out-collaborate. We’re prolific open source contributors through Ruby on Rails, Trix, Turbolinks, Stimulus, and many other projects. Extracting the common infrastructure that others could use as well is satisfying, important work, and we should continue to do that.\\nIt’s also worth mentioning that joining 37signals can be all-consuming. We’ve seen it happen. You dig 37signals, so you feel pressure to contribute, maybe overwhelmingly so. The people who work here are some of the best and brightest in our industry, so the self-imposed burden to be exceptional is real. But here’s the thing: stop it. Settle in. We’re glad you love this job because we all do too, but at the end of the day it’s a job. Do your best work, collaborate with your team, write, read, learn, and then turn off your computer and play with your dog. We’ll all be better for it.\\n\",\"renderedFileInfo\":null,\"tabSize\":8,\"topBannersInfo\":{\"overridingGlobalFundingFile\":false,\"globalPreferredFundingPath\":null,\"repoOwner\":\"basecamp\",\"repoName\":\"handbook\",\"showInvalidCitationWarning\":false,\"citationHelpUrl\":\"https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/about-citation-files\",\"showDependabotConfigurationBanner\":false,\"actionsOnboardingTip\":null},\"truncated\":false,\"viewable\":true,\"workflowRedirectUrl\":null,\"symbols\":{\"timedOut\":false,\"notAnalyzed\":true,\"symbols\":[]}},\"csrf_tokens\":{\"/basecamp/handbook/branches\":{\"post\":\"o3HTNEDyuKtINffBkguVz-P3KUwBN04ZM_vvyoNKymcy66lDUtXVvEi7EvsbgFoz2d3qgU_earsuIftbbtKlcg\"}}},\"title\":\"handbook/37signals-is-you.md at master · basecamp/handbook\",\"locale\":\"en\"}\n" + ] + } + ], + "source": [ + "print(doc.page_content)" + ] + }, + { + "cell_type": "markdown", + "id": "26538237-1fc2-4915-944a-1f68f3ae3759", + "metadata": {}, + "source": [ + " " + ] + }, + { + "cell_type": "markdown", + "id": "52025103-205e-4137-a116-89f37fcfece1", + "metadata": {}, + "source": [ + "可以看到上面的文档内容包含许多冗余的信息。通常来讲,我们需要进行对这种数据进行进一步处理(Post Processing)。" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "a7c5281f-aeed-4ee7-849b-bbf9fd3e35c7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "37signals Is You\n", + "Everyone working at 37signals represents 37signals. When a customer gets a response from Merissa on support, Merissa is 37signals. When a customer reads a tweet by Eron that our systems are down, Eron is 37signals. In those situations, all the other stuff we do to cultivate our best image is secondary. What’s right in front of someone in a time of need is what they’ll remember.\n", + "That’s what we mean when we say marketing is everyone’s responsibility, and that it pays to spend the time to recognize that. This means avoiding the bullshit of outage language and bending our policies, not just lending your ears. It means taking the time to get the writing right and consider how you’d feel if you were on the other side of the interaction.\n", + "The vast majority of our customers come from word of mouth and much of that word comes from people in our audience. This is an audience we’ve been educating and entertaining for 20 years and counting, and your voice is part of us now, whether you like it or not! Tell us and our audience what you have to say!\n", + "This goes for tools and techniques as much as it goes for prose. 37signals not only tries to out-teach the competition, but also out-share and out-collaborate. We’re prolific open source contributors through Ruby on Rails, Trix, Turbolinks, Stimulus, and many other projects. Extracting the common infrastructure that others could use as well is satisfying, important work, and we should continue to do that.\n", + "It’s also worth mentioning that joining 37signals can be all-consuming. We’ve seen it happen. You dig 37signals, so you feel pressure to contribute, maybe overwhelmingly so. The people who work here are some of the best and brightest in our industry, so the self-imposed burden to be exceptional is real. But here’s the thing: stop it. Settle in. We’re glad you love this job because we all do too, but at the end of the day it’s a job. Do your best work, collaborate with your team, write, read, learn, and then turn off your computer and play with your dog. We’ll all be better for it.\n", + "\n" + ] + } + ], + "source": [ + "import json\n", + "convert_to_json = json.loads(doc.page_content)\n", + "extracted_markdow = convert_to_json['payload']['blob']['richText']\n", + "print(extracted_markdow)" + ] + }, + { + "cell_type": "markdown", + "id": "d35e99b4-dd67-4940-bbeb-b2a59bf8cd3d", + "metadata": {}, + "source": [ + "## 四、Notion文档\n", + "\n", + "- 点击[Notion示例文档](https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f)右上方复制按钮(Duplicate),复制文档到你的Notion空间\n", + "- 点击右上方`⋯` 按钮,选择导出为Mardown&CSV。导出的文件将为zip文件夹\n", + "- 解压并保存mardown文档到本地路径`docs/Notion_DB/`" + ] + }, + { + "cell_type": "markdown", + "id": "f8cf2778-288c-4964-81e7-0ed881e31652", + "metadata": {}, + "source": [ + "### 4.1 加载Notion Markdown文档" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "081f5ee4-6b5d-45bf-a7e6-079abc560729", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.document_loaders import NotionDirectoryLoader\n", + "loader = NotionDirectoryLoader(\"docs/Notion_DB\")\n", + "docs = loader.load()" + ] + }, + { + "cell_type": "markdown", + "id": "88d5d094-a490-4c64-ab93-5c5cec0853aa", + "metadata": { + "tags": [] + }, + "source": [ + "### 4.2 探索加载的数据" + ] + }, + { + "cell_type": "markdown", + "id": "a3ffe318-d22a-4687-b613-f679ad9ad616", + "metadata": {}, + "source": [ + "文档加载后储存在`docs`变量中:\n", + "- `docs`的变量类型为`List`\n", + "- 打印 `docs` 的长度可以看到一共包含多少页" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "106323dd-0d24-40d4-8302-ed2a35d13347", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "print(type(docs))" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "fa3ee0fe-7daa-4193-9ee1-4ee89cb3b843", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1\n" + ] + } + ], + "source": [ + "print(len(docs))" + ] + }, + { + "cell_type": "markdown", + "id": "b3cd3feb-d2c1-46b6-8b16-d4b2101c5632", + "metadata": {}, + "source": [ + "`docs`中的每一元素为一个文档,变量类型为`langchain.schema.document.Document`, 文档变量类型包含两个属性\n", + "- `page_content` 包含该文档的内容。\n", + "- `meta_data` 为文档相关的描述性数据。" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "5b220df6-fd0b-4b62-9da4-29962926fe87", + "metadata": {}, + "outputs": [], + "source": [ + "doc = docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "98ef8fbf-e820-41de-919d-c462a910f4f1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "print(type(doc))" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "933da0f3-4d5f-4363-9142-050ecf226c1f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "# Blendle's Employee Handbook\n", + "\n", + "This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change. \n", + "\n", + "**Everything related to working at Blendle and the people of Blendle, made public.**\n", + "\n", + "These are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pv\n" + ] + } + ], + "source": [ + "print(doc.page_content[0:500])" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/content/LangChain Chat with Data/docs/Notion_DB/Blendle's Employee Handbook d331da39bd0341ed8d5ee2942fecf17a.md b/content/LangChain Chat with Data/docs/Notion_DB/Blendle's Employee Handbook d331da39bd0341ed8d5ee2942fecf17a.md new file mode 100644 index 0000000..a39a3c2 --- /dev/null +++ b/content/LangChain Chat with Data/docs/Notion_DB/Blendle's Employee Handbook d331da39bd0341ed8d5ee2942fecf17a.md @@ -0,0 +1,119 @@ +# Blendle's Employee Handbook + +This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change. + +**Everything related to working at Blendle and the people of Blendle, made public.** + +These are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more. + +We've made this document public because we want to learn from you. We're very much interested in your feedback (including weeding out typo's and Dunglish ;)). Email us at hr@blendle.com. If you're starting your own company or if you're curious as to how we do things at Blendle, we hope that our employee handbook inspires you. + +If you want to work at Blendle you can check our [job ads here](https://blendle.homerun.co/). If you want to be kept in the loop about Blendle, you can sign up for [our behind the scenes newsletter](https://blendle.homerun.co/yes-keep-me-posted/tr/apply?token=8092d4128c306003d97dd3821bad06f2). + +## Blendle general + +*Information gap closing in 3... 2... 1...* + +--- + +[To Do/Read in your first week](https://www.notion.so/To-Do-Read-in-your-first-week-9ef69b65b63a4ec7b8394ec703856c32?pvs=21) + +[History](https://www.notion.so/History-29b2b8fd36dd48db80dc682119aaefef?pvs=21) + +[DNA & culture](https://www.notion.so/DNA-culture-7723839e26124ed2ba3adafe8de0a080?pvs=21) + +[General & practical ](https://www.notion.so/General-practical-87085be150824011b79891eb30ca9530?pvs=21) + +## People operations + +*You can tell a company's DNA by looking at how they deal with the practical stuff.* + +--- + +[Office](https://www.notion.so/Office-b014d3d2c62240308865d11bba495322?pvs=21) + +[Time off: holidays and national holidays](https://www.notion.so/Time-off-holidays-and-national-holidays-bd94b931280a45a6b8eb3f29c2c4b42a?pvs=21) + +[Calling in sick/better](https://www.notion.so/Calling-in-sick-better-b82ec184fd544a8e9aa926ac37bb1ab1?pvs=21) + +[Perks and benefits](https://www.notion.so/Perks-and-benefits-820593b38ebc44209fe35ae553100de6?pvs=21) + +[Travel costs and reimbursements](https://www.notion.so/Travel-costs-and-reimbursements-e76623c6e0664863a769aeed028954e2?pvs=21) + +[Parenthood](https://www.notion.so/Parenthood-a6d62b65a9d84489a75586a3c542b3f1?pvs=21) + +## People topics + +*Themes we care about.* + +--- + +[Blendle Social Code](https://www.notion.so/Blendle-Social-Code-685a79c8df154ee09f35b35cc147af6b?pvs=21) + +[Diversity and inclusion](https://www.notion.so/Diversity-and-inclusion-d7f9d3e6b6ef4a1ab8f2c0a7b3ea3eec?pvs=21) + +[#letstalkaboutstress](https://www.notion.so/letstalkaboutstress-d46961f6ac98432ab07b5d5afc52c2d0?pvs=21) + +## Feedback and development + +*The number 1 reason for people to work at Blendle is growth and learning from smart people.* + +--- + +[Your 1st month ](https://www.notion.so/Your-1st-month-85909edc55a34f349bbed522c5245a65?pvs=21) + +[Goals](https://www.notion.so/Goals-122bff69bd634c519cd3c6dc01dbc282?pvs=21) + +[Feedback cycle](https://www.notion.so/Feedback-cycle-5f32358dba874c39be5ca5aa464c310e?pvs=21) + +[The Matrix™ (job profiles)](https://www.notion.so/The-Matrix-job-profiles-da91736ff35545458559eceb0075ed66?pvs=21) + +[Blendle library](https://www.notion.so/Blendle-library-f34188e536234c9a8976c9d4602b0be3?pvs=21) + +## **Hiring** + +*The coolest and most impactful thing when done right.* + +--- + +[Rating systems](https://www.notion.so/Rating-systems-2ba332377459427194acc798e5f8869c?pvs=21) + +[Getting people in (branding&sourcing)](https://www.notion.so/Getting-people-in-branding-sourcing-a3277fef078041a881f56556e24f0d8a?pvs=21) + +[Highly Skilled Migrants and relocation](https://www.notion.so/Highly-Skilled-Migrants-and-relocation-84a6576fb27d4a8fae2f73e4eae57d21?pvs=21) + +## How to lead at Blendle + +*Here are some tips and tools to help you become a great leader.* + +--- + +[How to lead at Blendle ](https://www.notion.so/How-to-lead-at-Blendle-f8c6b1d989d841bb87510fc2ab1ba970?pvs=21) + +[Your check-list](https://www.notion.so/Your-check-list-aaca857a846848688da3a37f28682c15?pvs=21) + +[Leading Feedback ](https://www.notion.so/Leading-Feedback-a1970c9f7b70443d881ca92d4e98be25?pvs=21) + +[Salary talks](https://www.notion.so/Salary-talks-35681ab732c048a9bbdf8c50babe64b5?pvs=21) + +[Hiring ](https://www.notion.so/Hiring-0bdf54d3d25f4c59bfdf3712a5104bbc?pvs=21) + +[Firing](https://www.notion.so/Firing-e0da1de62b304751bbd95a681908c7ad?pvs=21) + +[Party and study budget](https://www.notion.so/Party-and-study-budget-4e31001531c24d0fa447bbfcd6ccfd3f?pvs=21) + +[Holidays](https://www.notion.so/Holidays-1529506bb8884f0aa11cc799ced11ed0?pvs=21) + +[Sickness absence](https://www.notion.so/Sickness-absence-79a495f601df4004801475ea79b3d198?pvs=21) + +[Personal User Guide](https://www.notion.so/Personal-User-Guide-be2238ccb597412e8a517d40cda7e7d5?pvs=21) + +[Soft shizzle](https://www.notion.so/Soft-shizzle-41255d79fbe84492b153121cd7a2e3e8?pvs=21) + +## About this document + +--- + +*Lessons from three years of HR* + +[About this document and the author](https://www.notion.so/About-this-document-and-the-author-ee1faab1bcae4456b8c62043a8a194cd?pvs=21) \ No newline at end of file diff --git a/content/LangChain Chat with Data/docs/cs229_lectures/MachineLearning-Lecture01.pdf b/content/LangChain Chat with Data/docs/cs229_lectures/MachineLearning-Lecture01.pdf new file mode 100644 index 0000000..34da5de Binary files /dev/null and b/content/LangChain Chat with Data/docs/cs229_lectures/MachineLearning-Lecture01.pdf differ