Files
100-Days-Of-ML-Code/Code/Day 1_Data_Preprocessing.ipynb
2021-01-18 19:44:11 +08:00

312 lines
12 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 机器学习100天——第1天数据预处理Data Preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"搭建anaconda环境参考 https://zhuanlan.zhihu.com/p/33358809\n",
"\n",
"## 第一步:导入需要的库\n",
"这两个是我们每次都需要导入的库。NumPy包含数学计算函数。Pandas用于导入和管理数据集。"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[[ 7. 2. 3. ]\n [ 4. 3.5 6. ]\n [10. 3.5 9. ]]\nSklearn verion is 0.23.1\n"
]
}
],
"source": [
"import sklearn\n",
"from sklearn.impute import SimpleImputer\n",
"#This block is an example used to learn SimpleImputer\n",
"imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')\n",
"imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])\n",
"X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]\n",
"print(imp_mean.transform(X))\n",
"print(\"Sklearn verion is {}\".format(sklearn.__version__))"
]
},
{
"source": [
"from sklearn.preprocessing import OneHotEncoder\n",
"enc = OneHotEncoder(handle_unknown='ignore')\n",
"X = [['Male', 1], ['Female', 3], ['Female', 2]]\n",
">>> enc.fit(X)\n",
"OneHotEncoder(handle_unknown='ignore')\n",
">>> enc.categories_\n",
"[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]\n",
">>> enc.transform([['Female', 1], ['Male', 4]]).toarray()\n",
"array([[1., 0., 1., 0., 0.],\n",
" [0., 1., 0., 0., 0.]])\n",
">>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])\n",
"array([['Male', 1],\n",
" [None, 2]], dtype=object)\n",
">>> enc.get_feature_names(['gender', 'group'])\n",
"array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'],\n",
" dtype=object)"
],
"cell_type": "code",
"metadata": {},
"execution_count": 4,
"outputs": [
{
"output_type": "error",
"ename": "SyntaxError",
"evalue": "invalid syntax (<ipython-input-4-44f585aeb41d>, line 4)",
"traceback": [
"\u001b[1;36m File \u001b[1;32m\"<ipython-input-4-44f585aeb41d>\"\u001b[1;36m, line \u001b[1;32m4\u001b[0m\n\u001b[1;33m >>> enc.fit(X)\u001b[0m\n\u001b[1;37m ^\u001b[0m\n\u001b[1;31mSyntaxError\u001b[0m\u001b[1;31m:\u001b[0m invalid syntax\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 第二步:导入数据集\n",
"数据集通常是.csv格式。CSV文件以文本形式保存表格数据。文件的每一行是一条数据记录。我们使用Pandas的read_csv方法读取本地csv文件为一个数据帧。然后从数据帧中制作自变量和因变量的矩阵和向量。"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Step 2: Importing dataset\nX\n[['France' 44.0 72000.0]\n ['Spain' 27.0 48000.0]\n ['Germany' 30.0 54000.0]\n ['Spain' 38.0 61000.0]\n ['Germany' 40.0 nan]\n ['France' 35.0 58000.0]\n ['Spain' nan 52000.0]\n ['France' 48.0 79000.0]\n ['Germany' 50.0 83000.0]\n ['France' 37.0 67000.0]]\nY\n['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']\n[[44.0 72000.0]\n [27.0 48000.0]\n [30.0 54000.0]\n [38.0 61000.0]\n [40.0 nan]\n [35.0 58000.0]\n [nan 52000.0]\n [48.0 79000.0]\n [50.0 83000.0]\n [37.0 67000.0]]\n"
]
}
],
"source": [
"dataset = pd.read_csv('../datasets/Data.csv')\n",
"# 不包括最后一列的所有列\n",
"X = dataset.iloc[ : , :-1].values\n",
"#取最后一列\n",
"Y = dataset.iloc[ : , 3].values\n",
"print(\"Step 2: Importing dataset\")\n",
"print(\"X\")\n",
"print(X)\n",
"print(\"Y\")\n",
"print(Y)\n",
"print(X[ : , 1:3])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 第三步:处理丢失数据\n",
"我们得到的数据很少是完整的。数据可能因为各种原因丢失为了不降低机器学习模型的性能需要处理数据。我们可以用整列的平均值或中间值替换丢失的数据。我们用sklearn.preprocessing库中的Imputer类完成这项任务。"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"---------------------\nStep 3: Handling the missing data\nstep2\nX\n[['France' 44.0 72000.0]\n ['Spain' 27.0 48000.0]\n ['Germany' 30.0 54000.0]\n ['Spain' 38.0 61000.0]\n ['Germany' 40.0 63777.77777777778]\n ['France' 35.0 58000.0]\n ['Spain' 38.77777777777778 52000.0]\n ['France' 48.0 79000.0]\n ['Germany' 50.0 83000.0]\n ['France' 37.0 67000.0]]\n"
]
}
],
"source": [
"# If you use the newest version of sklearn, use the lines of code commented out\n",
"from sklearn.impute import SimpleImputer\n",
"imputer = SimpleImputer(missing_values=np.nan, strategy=\"mean\")\n",
"#from sklearn.preprocessing import Imputer\n",
"# axis=0表示按列进行\n",
"#imputer = Imputer(missing_values = \"NaN\", strategy = \"mean\", axis = 0)\n",
"#print(imputer)\n",
"#\n",
"# print(X[ : , 1:3])\n",
"imputer = imputer.fit(X[ : , 1:3]) #put the data we want to process in to this imputer\n",
"X[ : , 1:3] = imputer.transform(X[ : , 1:3]) #replace the np.nan with mean\n",
"#print(X[ : , 1:3])\n",
"print(\"---------------------\")\n",
"print(\"Step 3: Handling the missing data\")\n",
"print(\"step2\")\n",
"print(\"X\")\n",
"print(X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 第四步:解析分类数据\n",
"分类数据指的是含有标签值而不是数字值的变量。取值范围通常是固定的。例如\"Yes\"和\"No\"不能用于模型的数学计算所以需要解析成数字。为实现这一功能我们从sklearn.preprocessing库导入LabelEncoder类。"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"---------------------\nStep 4: Encoding categorical data\nX\n[[1.0 0.0 0.0 44.0 72000.0]\n [0.0 0.0 1.0 27.0 48000.0]\n [0.0 1.0 0.0 30.0 54000.0]\n [0.0 0.0 1.0 38.0 61000.0]\n [0.0 1.0 0.0 40.0 63777.77777777778]\n [1.0 0.0 0.0 35.0 58000.0]\n [0.0 0.0 1.0 38.77777777777778 52000.0]\n [1.0 0.0 0.0 48.0 79000.0]\n [0.0 1.0 0.0 50.0 83000.0]\n [1.0 0.0 0.0 37.0 67000.0]]\nY\n[0 1 0 0 1 1 0 1 0 1]\n"
]
}
],
"source": [
"from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
"from sklearn.compose import ColumnTransformer \n",
"#labelencoder_X = LabelEncoder()\n",
"#X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])\n",
"#Creating a dummy variable\n",
"#print(X)\n",
"ct = ColumnTransformer([(\"\", OneHotEncoder(), [0])], remainder = 'passthrough')\n",
"X = ct.fit_transform(X)\n",
"#onehotencoder = OneHotEncoder(categorical_features = [0])\n",
"#X = onehotencoder.fit_transform(X).toarray()\n",
"labelencoder_Y = LabelEncoder()\n",
"Y = labelencoder_Y.fit_transform(Y)\n",
"print(\"---------------------\")\n",
"print(\"Step 4: Encoding categorical data\")\n",
"print(\"X\")\n",
"print(X)\n",
"print(\"Y\")\n",
"print(Y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 第五步:拆分数据集为测试集合和训练集合\n",
"把数据集拆分成两个一个是用来训练模型的训练集合另一个是用来验证模型的测试集合。两者比例一般是80:20。我们导入sklearn.model_selection库中的train_test_split()方法。"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"---------------------\nStep 5: Splitting the datasets into training sets and Test sets\nX_train\n[[0.0 1.0 0.0 40.0 63777.77777777778]\n [1.0 0.0 0.0 37.0 67000.0]\n [0.0 0.0 1.0 27.0 48000.0]\n [0.0 0.0 1.0 38.77777777777778 52000.0]\n [1.0 0.0 0.0 48.0 79000.0]\n [0.0 0.0 1.0 38.0 61000.0]\n [1.0 0.0 0.0 44.0 72000.0]\n [1.0 0.0 0.0 35.0 58000.0]]\nX_test\n[[0.0 1.0 0.0 30.0 54000.0]\n [0.0 1.0 0.0 50.0 83000.0]]\nY_train\n[1 1 1 0 1 0 0 1]\nY_test\n[0 0]\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)\n",
"print(\"---------------------\")\n",
"print(\"Step 5: Splitting the datasets into training sets and Test sets\")\n",
"print(\"X_train\")\n",
"print(X_train)\n",
"print(\"X_test\")\n",
"print(X_test)\n",
"print(\"Y_train\")\n",
"print(Y_train)\n",
"print(\"Y_test\")\n",
"print(Y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 第六步:特征量化\n",
"大部分模型算法使用两点间的欧氏距离表示但此特征在幅度、单位和范围姿态问题上变化很大。在距离计算中高幅度的特征比低幅度特征权重更大。可用特征标准化或Z值归一化解决。导入sklearn.preprocessing库的StandardScalar类。"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"---------------------\nStep 6: Feature Scaling\nX_train\n[[-1. 2.64575131 -0.77459667 0.26306757 0.12381479]\n [ 1. -0.37796447 -0.77459667 -0.25350148 0.46175632]\n [-1. -0.37796447 1.29099445 -1.97539832 -1.53093341]\n [-1. -0.37796447 1.29099445 0.05261351 -1.11141978]\n [ 1. -0.37796447 -0.77459667 1.64058505 1.7202972 ]\n [-1. -0.37796447 1.29099445 -0.0813118 -0.16751412]\n [ 1. -0.37796447 -0.77459667 0.95182631 0.98614835]\n [ 1. -0.37796447 -0.77459667 -0.59788085 -0.48214934]]\nX_test\n[[-1. 2.64575131 -0.77459667 -1.45882927 -0.90166297]\n [-1. 2.64575131 -0.77459667 1.98496442 2.13981082]]\n"
]
}
],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"sc_X = StandardScaler()\n",
"X_train = sc_X.fit_transform(X_train)\n",
"X_test = sc_X.transform(X_test) #we should not use fit_transfer cause the u and z is determined from x_train\n",
"print(\"---------------------\")\n",
"print(\"Step 6: Feature Scaling\")\n",
"print(\"X_train\")\n",
"print(X_train)\n",
"print(\"X_test\")\n",
"print(X_test)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>完整的项目请前往Github项目<a href=\"https://github.com/MachineLearning100/100-Days-Of-ML-Code\">100-Days-Of-ML-Code</a>查看。有任何的建议或者意见欢迎在issue中提出~</b>"
]
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.8.3 64-bit (conda)",
"metadata": {
"interpreter": {
"hash": "1b78ff499ec469310b6a6795c4effbbfc85eb20a6ba0cf828a15721670711b2c"
}
}
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3-final"
}
},
"nbformat": 4,
"nbformat_minor": 2
}