diff --git a/Code/Day 1_Data Preprocessing.md b/Code/Day 1_Data Preprocessing.md new file mode 100644 index 0000000..3508943 --- /dev/null +++ b/Code/Day 1_Data Preprocessing.md @@ -0,0 +1,52 @@ +# 数据预处理 + +

+ +

+ +如图所示,通过6步完成数据预处理。 +此例用到的[数据](https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/datasets/Data.csv)。 + +## 第1步:导入库 +```Python +import numpy as np +import pandas as pd +``` +## 第2步:导入数据集 +```python +dataset = pd.read_csv('Data.csv') +X = dataset.iloc[ : , :-1].values +Y = dataset.iloc[ : , 3].values +``` +## 第3步:处理丢失数据 +```python +from sklearn.preprocessing import Imputer +imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0) +imputer = imputer.fit(X[ : , 1:3]) +X[ : , 1:3] = imputer.transform(X[ : , 1:3]) +``` +## 第4步:解析分类数据 +```python +from sklearn.preprocessing import LabelEncoder, OneHotEncoder +labelencoder_X = LabelEncoder() +X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0]) +``` +### 创建虚拟变量 +```python +onehotencoder = OneHotEncoder(categorical_features = [0]) +X = onehotencoder.fit_transform(X).toarray() +labelencoder_Y = LabelEncoder() +Y = labelencoder_Y.fit_transform(Y) +``` +## 第5步:拆分数据集为训练集合和测试集合 +```python +from sklearn.cross_validation import train_test_split +X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0) +``` +## 第6步:特征量化 +```python +from sklearn.preprocessing import StandardScaler +sc_X = StandardScaler() +X_train = sc_X.fit_transform(X_train) +X_test = sc_X.fit_transform(X_test) +``` \ No newline at end of file diff --git a/README.md b/README.md index a4dedb3..23a16f8 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ 英文原版请移步[Avik-Jain](https://github.com/Avik-Jain/100-Days-Of-ML-Code)。 -## 数据预处理 | 第1天 +## 数据预处理 | [第1天]()