307 lines
15 KiB
Plaintext
307 lines
15 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 机器学习100天——第1天:数据预处理(Data Preprocessing)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"搭建anaconda环境,参考 https://zhuanlan.zhihu.com/p/33358809\n",
|
||
"\n",
|
||
"## 第一步:导入需要的库\n",
|
||
"这两个是我们每次都需要导入的库。NumPy包含数学计算函数。Pandas用于导入和管理数据集。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 36,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"output_type": "stream",
|
||
"name": "stdout",
|
||
"text": [
|
||
"[[ 7. 2. 3. ]\n [ 4. 3.5 6. ]\n [10. 3.5 9. ]]\nSklearn verion is 0.23.1\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import sklearn\n",
|
||
"from sklearn.impute import SimpleImputer\n",
|
||
"#This block is an example used to learn SimpleImputer\n",
|
||
"imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')\n",
|
||
"imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])\n",
|
||
"X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]\n",
|
||
"print(imp_mean.transform(X))\n",
|
||
"print(\"Sklearn verion is {}\".format(sklearn.__version__))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from sklearn.preprocessing import OneHotEncoder\n",
|
||
"enc = OneHotEncoder(handle_unknown='ignore')\n",
|
||
"X = [['Male', 1], ['Female', 3], ['Female', 2]]\n",
|
||
">>> enc.fit(X)\n",
|
||
"OneHotEncoder(handle_unknown='ignore')\n",
|
||
">>> enc.categories_\n",
|
||
"[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]\n",
|
||
">>> enc.transform([['Female', 1], ['Male', 4]]).toarray()\n",
|
||
"array([[1., 0., 1., 0., 0.],\n",
|
||
" [0., 1., 0., 0., 0.]])\n",
|
||
">>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])\n",
|
||
"array([['Male', 1],\n",
|
||
" [None, 2]], dtype=object)\n",
|
||
">>> enc.get_feature_names(['gender', 'group'])\n",
|
||
"array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'],\n",
|
||
" dtype=object)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import pandas as pd"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 第二步:导入数据集\n",
|
||
"数据集通常是.csv格式。CSV文件以文本形式保存表格数据。文件的每一行是一条数据记录。我们使用Pandas的read_csv方法读取本地csv文件为一个数据帧。然后,从数据帧中制作自变量和因变量的矩阵和向量。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 57,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"output_type": "stream",
|
||
"name": "stdout",
|
||
"text": [
|
||
"Step 2: Importing dataset\nX\n[['France' 44.0 72000.0]\n ['Spain' 27.0 48000.0]\n ['Germany' 30.0 54000.0]\n ['Spain' 38.0 61000.0]\n ['Germany' 40.0 nan]\n ['France' 35.0 58000.0]\n ['Spain' nan 52000.0]\n ['France' 48.0 79000.0]\n ['Germany' 50.0 83000.0]\n ['France' 37.0 67000.0]]\nY\n['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']\n[[44.0 72000.0]\n [27.0 48000.0]\n [30.0 54000.0]\n [38.0 61000.0]\n [40.0 nan]\n [35.0 58000.0]\n [nan 52000.0]\n [48.0 79000.0]\n [50.0 83000.0]\n [37.0 67000.0]]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"dataset = pd.read_csv('../datasets/Data.csv')\n",
|
||
"# 不包括最后一列的所有列\n",
|
||
"X = dataset.iloc[ : , :-1].values\n",
|
||
"#取最后一列\n",
|
||
"Y = dataset.iloc[ : , 3].values\n",
|
||
"print(\"Step 2: Importing dataset\")\n",
|
||
"print(\"X\")\n",
|
||
"print(X)\n",
|
||
"print(\"Y\")\n",
|
||
"print(Y)\n",
|
||
"print(X[ : , 1:3])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 第三步:处理丢失数据\n",
|
||
"我们得到的数据很少是完整的。数据可能因为各种原因丢失,为了不降低机器学习模型的性能,需要处理数据。我们可以用整列的平均值或中间值替换丢失的数据。我们用sklearn.preprocessing库中的Imputer类完成这项任务。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 58,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"output_type": "stream",
|
||
"name": "stdout",
|
||
"text": [
|
||
"---------------------\nStep 3: Handling the missing data\nstep2\nX\n[['France' 44.0 72000.0]\n ['Spain' 27.0 48000.0]\n ['Germany' 30.0 54000.0]\n ['Spain' 38.0 61000.0]\n ['Germany' 40.0 63777.77777777778]\n ['France' 35.0 58000.0]\n ['Spain' 38.77777777777778 52000.0]\n ['France' 48.0 79000.0]\n ['Germany' 50.0 83000.0]\n ['France' 37.0 67000.0]]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# If you use the newest version of sklearn, use the lines of code commented out\n",
|
||
"from sklearn.impute import SimpleImputer\n",
|
||
"imputer = SimpleImputer(missing_values=np.nan, strategy=\"mean\")\n",
|
||
"#from sklearn.preprocessing import Imputer\n",
|
||
"# axis=0表示按列进行\n",
|
||
"#imputer = Imputer(missing_values = \"NaN\", strategy = \"mean\", axis = 0)\n",
|
||
"#print(imputer)\n",
|
||
"#\n",
|
||
"# print(X[ : , 1:3])\n",
|
||
"imputer = imputer.fit(X[ : , 1:3]) #put the data we want to process in to this imputer\n",
|
||
"X[ : , 1:3] = imputer.transform(X[ : , 1:3]) #replace the np.nan with mean\n",
|
||
"#print(X[ : , 1:3])\n",
|
||
"print(\"---------------------\")\n",
|
||
"print(\"Step 3: Handling the missing data\")\n",
|
||
"print(\"step2\")\n",
|
||
"print(\"X\")\n",
|
||
"print(X)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 第四步:解析分类数据\n",
|
||
"分类数据指的是含有标签值而不是数字值的变量。取值范围通常是固定的。例如\"Yes\"和\"No\"不能用于模型的数学计算,所以需要解析成数字。为实现这一功能,我们从sklearn.preprocessing库导入LabelEncoder类。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 59,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"output_type": "stream",
|
||
"name": "stdout",
|
||
"text": [
|
||
"['France' 'Spain' 'Germany' 'Spain' 'Germany' 'France' 'Spain' 'France'\n 'Germany' 'France']\n[0 2 1 2 1 0 2 0 1 0]\n[[0 44.0 72000.0]\n [2 27.0 48000.0]\n [1 30.0 54000.0]\n [2 38.0 61000.0]\n [1 40.0 63777.77777777778]\n [0 35.0 58000.0]\n [2 38.77777777777778 52000.0]\n [0 48.0 79000.0]\n [1 50.0 83000.0]\n [0 37.0 67000.0]]\n---------------------\nStep 4: Encoding categorical data\nX\n[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]\n [0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n [0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]\n [0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]\n [0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]\n [1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]\n [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]\n [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]\n [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]\n [1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]\nY\n[0 1 0 0 1 1 0 1 0 1]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
|
||
"labelencoder_X = LabelEncoder()\n",
|
||
"print(X[ : , 0])\n",
|
||
"X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])\n",
|
||
"print(X[ : , 0])\n",
|
||
"#Creating a dummy variable\n",
|
||
"onehotencoder = OneHotEncoder('auto')\n",
|
||
"print(X)\n",
|
||
"X = onehotencoder.fit_transform(X).toarray()\n",
|
||
"labelencoder_Y = LabelEncoder()\n",
|
||
"Y = labelencoder_Y.fit_transform(Y)\n",
|
||
"print(\"---------------------\")\n",
|
||
"print(\"Step 4: Encoding categorical data\")\n",
|
||
"print(\"X\")\n",
|
||
"print(X)\n",
|
||
"print(\"Y\")\n",
|
||
"print(Y)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 第五步:拆分数据集为测试集合和训练集合\n",
|
||
"把数据集拆分成两个:一个是用来训练模型的训练集合,另一个是用来验证模型的测试集合。两者比例一般是80:20。我们导入sklearn.model_selection库中的train_test_split()方法。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 60,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"output_type": "stream",
|
||
"name": "stdout",
|
||
"text": [
|
||
"---------------------\nStep 5: Splitting the datasets into training sets and Test sets\nX_train\n[[0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]\n [1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]\n [0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]\n [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]\n [0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]\n [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]\n [1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]]\nX_test\n[[0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]\n [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]\nY_train\n[1 1 1 0 1 0 0 1]\nY_test\n[0 0]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from sklearn.model_selection import train_test_split\n",
|
||
"X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)\n",
|
||
"print(\"---------------------\")\n",
|
||
"print(\"Step 5: Splitting the datasets into training sets and Test sets\")\n",
|
||
"print(\"X_train\")\n",
|
||
"print(X_train)\n",
|
||
"print(\"X_test\")\n",
|
||
"print(X_test)\n",
|
||
"print(\"Y_train\")\n",
|
||
"print(Y_train)\n",
|
||
"print(\"Y_test\")\n",
|
||
"print(Y_test)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 第六步:特征量化\n",
|
||
"大部分模型算法使用两点间的欧氏距离表示,但此特征在幅度、单位和范围姿态问题上变化很大。在距离计算中,高幅度的特征比低幅度特征权重更大。可用特征标准化或Z值归一化解决。导入sklearn.preprocessing库的StandardScalar类。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 44,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"output_type": "stream",
|
||
"name": "stdout",
|
||
"text": [
|
||
"---------------------\nStep 6: Feature Scaling\nX_train\n[[-1. 2.64575131 -0.77459667 -0.37796447 0. -0.37796447\n -0.37796447 -0.37796447 -0.37796447 2.64575131 -0.37796447 -0.37796447\n 0. -0.37796447 -0.37796447 0. -0.37796447 -0.37796447\n 2.64575131 -0.37796447 -0.37796447 -0.37796447 0. ]\n [ 1. -0.37796447 -0.77459667 -0.37796447 0. -0.37796447\n 2.64575131 -0.37796447 -0.37796447 -0.37796447 -0.37796447 -0.37796447\n 0. -0.37796447 -0.37796447 0. -0.37796447 -0.37796447\n -0.37796447 2.64575131 -0.37796447 -0.37796447 0. ]\n [-1. -0.37796447 1.29099445 2.64575131 0. -0.37796447\n -0.37796447 -0.37796447 -0.37796447 -0.37796447 -0.37796447 -0.37796447\n 0. 2.64575131 -0.37796447 0. -0.37796447 -0.37796447\n -0.37796447 -0.37796447 -0.37796447 -0.37796447 0. ]\n [-1. -0.37796447 1.29099445 -0.37796447 0. -0.37796447\n -0.37796447 -0.37796447 2.64575131 -0.37796447 -0.37796447 -0.37796447\n 0. -0.37796447 2.64575131 0. -0.37796447 -0.37796447\n -0.37796447 -0.37796447 -0.37796447 -0.37796447 0. ]\n [ 1. -0.37796447 -0.77459667 -0.37796447 0. -0.37796447\n -0.37796447 -0.37796447 -0.37796447 -0.37796447 -0.37796447 2.64575131\n 0. -0.37796447 -0.37796447 0. -0.37796447 -0.37796447\n -0.37796447 -0.37796447 -0.37796447 2.64575131 0. ]\n [-1. -0.37796447 1.29099445 -0.37796447 0. -0.37796447\n -0.37796447 2.64575131 -0.37796447 -0.37796447 -0.37796447 -0.37796447\n 0. -0.37796447 -0.37796447 0. -0.37796447 2.64575131\n -0.37796447 -0.37796447 -0.37796447 -0.37796447 0. ]\n [ 1. -0.37796447 -0.77459667 -0.37796447 0. -0.37796447\n -0.37796447 -0.37796447 -0.37796447 -0.37796447 2.64575131 -0.37796447\n 0. -0.37796447 -0.37796447 0. -0.37796447 -0.37796447\n -0.37796447 -0.37796447 2.64575131 -0.37796447 0. ]\n [ 1. -0.37796447 -0.77459667 -0.37796447 0. 2.64575131\n -0.37796447 -0.37796447 -0.37796447 -0.37796447 -0.37796447 -0.37796447\n 0. -0.37796447 -0.37796447 0. 2.64575131 -0.37796447\n -0.37796447 -0.37796447 -0.37796447 -0.37796447 0. ]]\nX_test\n[[-1. 2.64575131 -0.77459667 -0.37796447 1. -0.37796447\n -0.37796447 -0.37796447 -0.37796447 -0.37796447 -0.37796447 -0.37796447\n 0. -0.37796447 -0.37796447 1. -0.37796447 -0.37796447\n -0.37796447 -0.37796447 -0.37796447 -0.37796447 0. ]\n [-1. 2.64575131 -0.77459667 -0.37796447 0. -0.37796447\n -0.37796447 -0.37796447 -0.37796447 -0.37796447 -0.37796447 -0.37796447\n 1. -0.37796447 -0.37796447 0. -0.37796447 -0.37796447\n -0.37796447 -0.37796447 -0.37796447 -0.37796447 1. ]]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from sklearn.preprocessing import StandardScaler\n",
|
||
"sc_X = StandardScaler()\n",
|
||
"X_train = sc_X.fit_transform(X_train)\n",
|
||
"X_test = sc_X.transform(X_test)\n",
|
||
"print(\"---------------------\")\n",
|
||
"print(\"Step 6: Feature Scaling\")\n",
|
||
"print(\"X_train\")\n",
|
||
"print(X_train)\n",
|
||
"print(\"X_test\")\n",
|
||
"print(X_test)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"<b>完整的项目请前往Github项目<a href=\"https://github.com/MachineLearning100/100-Days-Of-ML-Code\">100-Days-Of-ML-Code</a>查看。有任何的建议或者意见欢迎在issue中提出~</b>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"collapsed": true
|
||
},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"name": "python3",
|
||
"display_name": "Python 3.8.3 64-bit (conda)",
|
||
"metadata": {
|
||
"interpreter": {
|
||
"hash": "1b78ff499ec469310b6a6795c4effbbfc85eb20a6ba0cf828a15721670711b2c"
|
||
}
|
||
}
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.8.3-final"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
} |