Files
100-Days-Of-ML-Code/Code/Day 1_Data_Preprocessing.ipynb
yx-xyc 9b231e4166 For better comprehension
All the Operations I did to code for my own understanding.
2021-01-13 17:31:36 +08:00

330 lines
16 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 机器学习100天——第1天数据预处理Data Preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"搭建anaconda环境参考 https://zhuanlan.zhihu.com/p/33358809\n",
"\n",
"## 第一步:导入需要的库\n",
"这两个是我们每次都需要导入的库。NumPy包含数学计算函数。Pandas用于导入和管理数据集。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 第二步:导入数据集\n",
"数据集通常是.csv格式。CSV文件以文本形式保存表格数据。文件的每一行是一条数据记录。我们使用Pandas的read_csv方法读取本地csv文件为一个数据帧。然后从数据帧中制作自变量和因变量的矩阵和向量。"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Step 2: Importing dataset\nX\n[['France' 44.0 72000.0]\n ['Spain' 27.0 48000.0]\n ['Germany' 30.0 54000.0]\n ['Spain' 38.0 61000.0]\n ['Germany' 40.0 nan]\n ['France' 35.0 58000.0]\n ['Spain' nan 52000.0]\n ['France' 48.0 79000.0]\n ['Germany' 50.0 83000.0]\n ['France' 37.0 67000.0]]\nY\n['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']\n"
]
}
],
"source": [
"dataset = pd.read_csv('../datasets/Data.csv')\n",
"# 不包括最后一列的所有列\n",
"X = dataset.iloc[ : , :-1].values\n",
"#取最后一列\n",
"Y = dataset.iloc[ : , 3].values\n",
"print(\"Step 2: Importing dataset\")\n",
"print(\"X\")\n",
"print(X)\n",
"print(\"Y\")\n",
"print(Y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 第三步:处理丢失数据\n",
"我们得到的数据很少是完整的。数据可能因为各种原因丢失为了不降低机器学习模型的性能需要处理数据。我们可以用整列的平均值或中间值替换丢失的数据。我们用sklearn.preprocessing库中的Imputer类完成这项任务。"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[[44.0 72000.0]\n [27.0 48000.0]\n [30.0 54000.0]\n [38.0 61000.0]]\n"
]
},
{
"output_type": "error",
"ename": "ValueError",
"evalue": "'X' and 'missing_values' types are expected to be both numerical. Got X.dtype=float64 and type(missing_values)=<class 'str'>.",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-17-ee12d1366c46>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m 6\u001b[0m \u001b[1;31m#imputer = Imputer(missing_values = \"NaN\", strategy = \"mean\", axis = 0)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 7\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m[\u001b[0m \u001b[1;36m0\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;36m4\u001b[0m \u001b[1;33m,\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 8\u001b[1;33m \u001b[0mimputer\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mimputer\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m[\u001b[0m \u001b[1;36m0\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;36m4\u001b[0m \u001b[1;33m,\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 9\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m[\u001b[0m \u001b[1;33m:\u001b[0m \u001b[1;33m,\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mimputer\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m[\u001b[0m \u001b[1;33m:\u001b[0m \u001b[1;33m,\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 10\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"---------------------\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\anaconda3\\lib\\site-packages\\sklearn\\impute\\_base.py\u001b[0m in \u001b[0;36mfit\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 275\u001b[0m \u001b[0mself\u001b[0m \u001b[1;33m:\u001b[0m \u001b[0mSimpleImputer\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 276\u001b[0m \"\"\"\n\u001b[1;32m--> 277\u001b[1;33m \u001b[0mX\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_validate_input\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0min_fit\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 278\u001b[0m \u001b[0msuper\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_fit_indicator\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 279\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\anaconda3\\lib\\site-packages\\sklearn\\impute\\_base.py\u001b[0m in \u001b[0;36m_validate_input\u001b[1;34m(self, X, in_fit)\u001b[0m\n\u001b[0;32m 251\u001b[0m \u001b[1;32mraise\u001b[0m \u001b[0mve\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 252\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 253\u001b[1;33m \u001b[0m_check_inputs_dtype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmissing_values\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 254\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mkind\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[1;33m(\u001b[0m\u001b[1;34m\"i\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"u\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"f\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"O\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 255\u001b[0m raise ValueError(\"SimpleImputer does not support data with dtype \"\n",
"\u001b[1;32m~\\anaconda3\\lib\\site-packages\\sklearn\\impute\\_base.py\u001b[0m in \u001b[0;36m_check_inputs_dtype\u001b[1;34m(X, missing_values)\u001b[0m\n\u001b[0;32m 23\u001b[0m if (X.dtype.kind in (\"f\", \"i\", \"u\") and\n\u001b[0;32m 24\u001b[0m not isinstance(missing_values, numbers.Real)):\n\u001b[1;32m---> 25\u001b[1;33m raise ValueError(\"'X' and 'missing_values' types are expected to be\"\n\u001b[0m\u001b[0;32m 26\u001b[0m \u001b[1;34m\" both numerical. Got X.dtype={} and \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 27\u001b[0m \u001b[1;34m\" type(missing_values)={}.\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mValueError\u001b[0m: 'X' and 'missing_values' types are expected to be both numerical. Got X.dtype=float64 and type(missing_values)=<class 'str'>."
]
}
],
"source": [
"# If you use the newest version of sklearn, use the lines of code commented out\n",
"from sklearn.impute import SimpleImputer\n",
"imputer = SimpleImputer(missing_values=np.nan, strategy=\"mean\")\n",
"#from sklearn.preprocessing import Imputer\n",
"# axis=0表示按列进行\n",
"#imputer = Imputer(missing_values = \"NaN\", strategy = \"mean\", axis = 0)\n",
"print(X[ 0:4 , 1:3])\n",
"imputer = imputer.fit(X[ 0:4 , 1:3])\n",
"X[ : , 1:3] = imputer.transform(X[ : , 1:3])\n",
"print(\"---------------------\")\n",
"print(\"Step 3: Handling the missing data\")\n",
"print(\"step2\")\n",
"print(\"X\")\n",
"print(X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 第四步:解析分类数据\n",
"分类数据指的是含有标签值而不是数字值的变量。取值范围通常是固定的。例如\"Yes\"和\"No\"不能用于模型的数学计算所以需要解析成数字。为实现这一功能我们从sklearn.preprocessing库导入LabelEncoder类。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"---------------------\n",
"Step 4: Encoding categorical data\n",
"X\n",
"[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01\n",
" 7.20000000e+04]\n",
" [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01\n",
" 4.80000000e+04]\n",
" [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01\n",
" 5.40000000e+04]\n",
" [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01\n",
" 6.10000000e+04]\n",
" [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01\n",
" 6.37777778e+04]\n",
" [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01\n",
" 5.80000000e+04]\n",
" [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01\n",
" 5.20000000e+04]\n",
" [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01\n",
" 7.90000000e+04]\n",
" [0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01\n",
" 8.30000000e+04]\n",
" [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01\n",
" 6.70000000e+04]]\n",
"Y\n",
"[0 1 0 0 1 1 0 1 0 1]\n"
]
}
],
"source": [
"from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
"labelencoder_X = LabelEncoder()\n",
"X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])\n",
"#Creating a dummy variable\n",
"onehotencoder = OneHotEncoder(categorical_features = [0])\n",
"X = onehotencoder.fit_transform(X).toarray()\n",
"labelencoder_Y = LabelEncoder()\n",
"Y = labelencoder_Y.fit_transform(Y)\n",
"print(\"---------------------\")\n",
"print(\"Step 4: Encoding categorical data\")\n",
"print(\"X\")\n",
"print(X)\n",
"print(\"Y\")\n",
"print(Y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 第五步:拆分数据集为测试集合和训练集合\n",
"把数据集拆分成两个一个是用来训练模型的训练集合另一个是用来验证模型的测试集合。两者比例一般是80:20。我们导入sklearn.model_selection库中的train_test_split()方法。"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"---------------------\n",
"Step 5: Splitting the datasets into training sets and Test sets\n",
"X_train\n",
"[[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01\n",
" 6.37777778e+04]\n",
" [ 1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01\n",
" 6.70000000e+04]\n",
" [ 0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01\n",
" 4.80000000e+04]\n",
" [ 0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01\n",
" 5.20000000e+04]\n",
" [ 1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01\n",
" 7.90000000e+04]\n",
" [ 0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01\n",
" 6.10000000e+04]\n",
" [ 1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01\n",
" 7.20000000e+04]\n",
" [ 1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01\n",
" 5.80000000e+04]]\n",
"X_test\n",
"[[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01\n",
" 5.40000000e+04]\n",
" [ 0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01\n",
" 8.30000000e+04]]\n",
"Y_train\n",
"[1 1 1 0 1 0 0 1]\n",
"Y_test\n",
"[0 0]\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)\n",
"print(\"---------------------\")\n",
"print(\"Step 5: Splitting the datasets into training sets and Test sets\")\n",
"print(\"X_train\")\n",
"print(X_train)\n",
"print(\"X_test\")\n",
"print(X_test)\n",
"print(\"Y_train\")\n",
"print(Y_train)\n",
"print(\"Y_test\")\n",
"print(Y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 第六步:特征量化\n",
"大部分模型算法使用两点间的欧氏距离表示但此特征在幅度、单位和范围姿态问题上变化很大。在距离计算中高幅度的特征比低幅度特征权重更大。可用特征标准化或Z值归一化解决。导入sklearn.preprocessing库的StandardScalar类。"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"---------------------\n",
"Step 6: Feature Scaling\n",
"X_train\n",
"[[-1. 2.64575131 -0.77459667 0.26306757 0.12381479]\n",
" [ 1. -0.37796447 -0.77459667 -0.25350148 0.46175632]\n",
" [-1. -0.37796447 1.29099445 -1.97539832 -1.53093341]\n",
" [-1. -0.37796447 1.29099445 0.05261351 -1.11141978]\n",
" [ 1. -0.37796447 -0.77459667 1.64058505 1.7202972 ]\n",
" [-1. -0.37796447 1.29099445 -0.0813118 -0.16751412]\n",
" [ 1. -0.37796447 -0.77459667 0.95182631 0.98614835]\n",
" [ 1. -0.37796447 -0.77459667 -0.59788085 -0.48214934]]\n",
"X_test\n",
"[[ 0. 0. 0. -1. -1.]\n",
" [ 0. 0. 0. 1. 1.]]\n"
]
}
],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"sc_X = StandardScaler()\n",
"X_train = sc_X.fit_transform(X_train)\n",
"X_test = sc_X.transform(X_test)\n",
"print(\"---------------------\")\n",
"print(\"Step 6: Feature Scaling\")\n",
"print(\"X_train\")\n",
"print(X_train)\n",
"print(\"X_test\")\n",
"print(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>完整的项目请前往Github项目<a href=\"https://github.com/MachineLearning100/100-Days-Of-ML-Code\">100-Days-Of-ML-Code</a>查看。有任何的建议或者意见欢迎在issue中提出~</b>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.8.3 64-bit (conda)",
"metadata": {
"interpreter": {
"hash": "1b78ff499ec469310b6a6795c4effbbfc85eb20a6ba0cf828a15721670711b2c"
}
}
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3-final"
}
},
"nbformat": 4,
"nbformat_minor": 2
}