首发于快乐机器学习

数据挖掘（一）EDA-数据探索性分析

毛小伟

EDA（Exploratory Data Analysis），全名数据探索性分析，是通过了解数据集，了解变量间的相互关系以及变量与预测值之间的关系，从而帮助我们后期更好地进行特征工程和建立模型，是数据挖掘中十分重要的一步。

所需工具：数据科学库（pandas、numpy、scipy）、可视化库（matplotlib、seabon）

大致包含步骤：

对于不懂的名词，文末有参考

下面我以天池二手车价格预测为例子，进行以上步骤的分析。

# package imports
import gc
import pandas as pd
import numpy as np 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn import metrics

from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.model_selection import train_test_split
# 相关全局设置
pd.set_option('display.max_columns', None)
sns.set()
data_train = pd.read_csv('used_car_train_20200313.csv', sep=' ')
data_testA = pd.read_csv('used_car_testA_20200313.csv', sep=' ')

1. 查看数据集整体情况

用pandas的.head()等方法，查看数据的具体形式；用.info()查看数据的类型和数据量；用.describe查看数据极值、均值、方差等统计指标以及在回归等问题中，需要保持量纲一致，方法是对变量进行归一化。

data_train.head().append(data_train.tail())
data_train.describe()
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
data_testA.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SaleID             50000 non-null  int64  
 1   name               50000 non-null  int64  
 2   regDate            50000 non-null  int64  
 3   model              50000 non-null  float64
 4   brand              50000 non-null  int64  
 5   bodyType           48587 non-null  float64
 6   fuelType           47107 non-null  float64
 7   gearbox            48090 non-null  float64
 8   power              50000 non-null  int64  
 9   kilometer          50000 non-null  float64
 10  notRepairedDamage  50000 non-null  object 
 11  regionCode         50000 non-null  int64  
 12  seller             50000 non-null  int64  
 13  offerType          50000 non-null  int64  
 14  creatDate          50000 non-null  int64  
 15  v_0                50000 non-null  float64
 16  v_1                50000 non-null  float64
 17  v_2                50000 non-null  float64
 18  v_3                50000 non-null  float64
 19  v_4                50000 non-null  float64
 20  v_5                50000 non-null  float64
 21  v_6                50000 non-null  float64
 22  v_7                50000 non-null  float64
 23  v_8                50000 non-null  float64
 24  v_9                50000 non-null  float64
 25  v_10               50000 non-null  float64
 26  v_11               50000 non-null  float64
 27  v_12               50000 non-null  float64
 28  v_13               50000 non-null  float64
 29  v_14               50000 non-null  float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB

2. 查看数据缺失和异常

通过查看数据结合赛题给出的数据信息，我们可以把特征分成三部分，分别是日期特征、类别特征、数值特征。然后看看每一维特征的缺失率、类别个数和异常值（异常值不处理可能会造成过拟合）等信息

#这个区别方式适用于没有数据脱敏。这里因为数据脱敏，比如name应该是类别变量。所以此方法这里不适用，只能手动区分了
#numeric_features = data_train.select_dtypes(include=[np.number])
date_cols = ['regDate', 'creatDate']
cate_cols = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode', 'seller', 'offerType']
num_cols = ['power', 'kilometer'] + ['v_{}'.format(i) for i in range(15)]
data = pd.concat([data_train, data_testA])
cols = date_cols + cate_cols + num_cols

tmp = pd.DataFrame()
tmp['count'] = data[cols].count().values
tmp['missing_rate'] = (data.shape[0] - tmp['count']) / data.shape[0]
tmp['nunique'] = data[cols].nunique().values
tmp.index = cols
tmp

#通过小提琴图图查看异常值，以price为例
sns.violinplot(np.log(data_train['price']))

3. 查看预测值的分布

目标变量要尽量符合高斯分布。有很多算法的前提假设是数据符合正态分布，例如线性回归里面最小二乘法的一个前提假设就是数据符合正态分布。下面两个图分布是目标变量转换前，转换后

sns.distplot(data_train['price'])

sns.distplot(np.log(data_train['price']))

4. 数值型特征分析

查看偏度和峰度

如果训练集和测试集分布不一致，就要考虑进行分布转换。还好，我们这里数据分布都很一致，不需要转换

tmp = pd.DataFrame(index = num_cols)
for col in num_cols:
    tmp.loc[col, 'train_Skewness'] = data_train[col].skew()
    tmp.loc[col, 'test_Skewness'] = data_testA[col].skew()
    tmp.loc[col, 'train_Kurtosis'] = data_train[col].kurt()
    tmp.loc[col, 'test_Kurtosis'] = data_testA[col].kurt()
tmp

5. 类别特征分析

查看类别个数及其分布情况

类别倾斜非常严重的可以删掉，比如：seller、offerType

有异常值的要处理，比如：notRepairedDamage

for col in cate_cols:
    col
    data_train[col].value_counts()

'name'
708       282
387       282
55        280
1541      263
203       233
         ... 
5074        1
7123        1
11221       1
13270       1
174485      1
Name: name, Length: 99662, dtype: int64

'model'
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186
         ...  
245.0        2
209.0        2
240.0        2
242.0        2
247.0        1
Name: model, Length: 248, dtype: int64

'brand'
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64

'bodyType'
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64
'fuelType'
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64

'gearbox'
0.0    111623
1.0     32396
Name: gearbox, dtype: int64

'notRepairedDamage'
0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

'regionCode'
419     369
764     258
125     137
176     136
462     134
       ... 
6414      1
7063      1
4239      1
5931      1
7267      1
Name: regionCode, Length: 7905, dtype: int64

'seller'
0    149999
1         1
Name: seller, dtype: int64

'offerType'
0    150000
Name: offerType, dtype: int64

结语

至此，对于数据的分析有了一个概况，比如我们了解到v12、kilometers和v3是几个比较重要的特征，匿名特征的分布较为符合正态分布等。对那些数据拥有缺省值和异常值也有了一个了解，这对于后续的工作是十分有帮助的。

欢迎大家讨论指正~

参考：

编辑于 2020-03-25 11:38

数据科学

数据挖掘

数据分析

文章被以下专栏收录

快乐机器学习

一起快乐学习，快乐AI~