{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 机器学习100天——第1天:数据预处理(Data Preprocessing)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "搭建anaconda环境,参考 https://zhuanlan.zhihu.com/p/33358809\n", "\n", "## 第一步:导入需要的库\n", "这两个是我们每次都需要导入的库。NumPy包含数学计算函数。Pandas用于导入和管理数据集。" ] }, { "cell_type": "code", <<<<<<< Updated upstream "execution_count": 4, ======= "execution_count": 2, >>>>>>> Stashed changes "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 第二步:导入数据集\n", "数据集通常是.csv格式。CSV文件以文本形式保存表格数据。文件的每一行是一条数据记录。我们使用Pandas的read_csv方法读取本地csv文件为一个数据帧。然后,从数据帧中制作自变量和因变量的矩阵和向量。" ] }, { "cell_type": "code", <<<<<<< Updated upstream "execution_count": 7, ======= "execution_count": 6, >>>>>>> Stashed changes "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ <<<<<<< Updated upstream "Step 2: Importing dataset\nX\n[['France' 44.0 72000.0]\n ['Spain' 27.0 48000.0]\n ['Germany' 30.0 54000.0]\n ['Spain' 38.0 61000.0]\n ['Germany' 40.0 nan]\n ['France' 35.0 58000.0]\n ['Spain' nan 52000.0]\n ['France' 48.0 79000.0]\n ['Germany' 50.0 83000.0]\n ['France' 37.0 67000.0]]\nY\n['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']\n" ======= "Step 2: Importing dataset\nX:\n[['France' 44.0 72000.0]\n ['Spain' 27.0 48000.0]\n ['Germany' 30.0 54000.0]\n ['Spain' 38.0 61000.0]\n ['Germany' 40.0 nan]\n ['France' 35.0 58000.0]\n ['Spain' nan 52000.0]\n ['France' 48.0 79000.0]\n ['Germany' 50.0 83000.0]\n ['France' 37.0 67000.0]]\nY:\n['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']\n" >>>>>>> Stashed changes ] }, { "output_type": "execute_result", "data": { "text/plain": [ " Country Age Salary Purchased\n", "0 France 44.0 72000.0 No\n", "1 Spain 27.0 48000.0 Yes\n", "2 Germany 30.0 54000.0 No\n", "3 Spain 38.0 61000.0 No\n", "4 Germany 40.0 NaN Yes\n", "5 France 35.0 58000.0 Yes\n", "6 Spain NaN 52000.0 No\n", "7 France 48.0 79000.0 Yes\n", "8 Germany 50.0 83000.0 No\n", "9 France 37.0 67000.0 Yes" ], "text/html": "
| \n | Country | \nAge | \nSalary | \nPurchased | \n
|---|---|---|---|---|
| 0 | \nFrance | \n44.0 | \n72000.0 | \nNo | \n
| 1 | \nSpain | \n27.0 | \n48000.0 | \nYes | \n
| 2 | \nGermany | \n30.0 | \n54000.0 | \nNo | \n
| 3 | \nSpain | \n38.0 | \n61000.0 | \nNo | \n
| 4 | \nGermany | \n40.0 | \nNaN | \nYes | \n
| 5 | \nFrance | \n35.0 | \n58000.0 | \nYes | \n
| 6 | \nSpain | \nNaN | \n52000.0 | \nNo | \n
| 7 | \nFrance | \n48.0 | \n79000.0 | \nYes | \n
| 8 | \nGermany | \n50.0 | \n83000.0 | \nNo | \n
| 9 | \nFrance | \n37.0 | \n67000.0 | \nYes | \n