数据规整(数据预处理,数据清洗)
数据规整的一般分类:
- 清理
- 转换
- 合并
- 重塑
Pandas数据规整-清理:
对指定数据(如缺失数据、重复数据)进行处理(检查、替换、删除)
- 缺失值的表示:np.nan
- 检查缺失值:isnull(),notnull(),info()
- 删除缺失值:dropna()
- 填充缺失值:fillna()
- 替换值(填充缺失值是替换值的一种情况):replace()
- 移除重复数据
- 检测和过滤异常值
In [1]:
import pandas as pd
import numpy as np
Pandas缺失数据处理
缺失值的表示
In [2]:
a = np.array([2, 4, 8, 10, 12])
a
Out[2]:
array([ 2, 4, 8, 10, 12])
In [3]:
a + 10
Out[3]:
array([12, 14, 18, 20, 22])
Python原生缺失值表示:None
运算直接报错
In [4]:
b = np.array([2, 4, None, 10, 12])
b
Out[4]:
array([2, 4, None, 10, 12], dtype=object)
In [5]:
# b + 10 # 缺失值导致计算报错
使用Numpy的缺失值数据类型:np.nan
缺失值运算不会报错,和缺失值进行运算,结果还是缺失值
In [6]:
c = np.array([2, 4, np.nan, 10, 12])
c
Out[6]:
array([ 2., 4., nan, 10., 12.])
In [7]:
c + 10
Out[7]:
array([12., 14., nan, 20., 22.])
In [8]:
c.sum() # 任何数组和缺失值计算,结果还是缺失值
np.sum(c)
Out[8]:
nan
nan专有运算方法,会跳过缺失值,直接计算正常值
In [9]:
np.nansum(c)
np.nanmean(c)
Out[9]:
7.0
In [10]:
(2 + 4 + 10 + 12) / 4 # 除以4, 缺失值不参与运算
Out[10]:
7.0
使用Pandas缺失值计算
Pandas中,不论缺失值是 None 还是 np.nan ,都会被转化为 NaN 的形式
NaN:非数字,not a number,Pandas中它表示缺失或NA值,便于被检测出来
本质上就是np.nan
Pandas的对象可以跳过缺失值直接进行运算
In [11]:
b = pd.Series([1,2,np.nan,4,None,6])
b
Out[11]:
0 1.0
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [12]:
b + 10
Out[12]:
0 11.0
1 12.0
2 NaN
3 14.0
4 NaN
5 16.0
dtype: float64
In [13]:
b.sum()
b.mean()
Out[13]:
3.25
In [14]:
# 缺失值赋值
b[0] = np.nan
b
Out[14]:
0 NaN
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [15]:
c = pd.DataFrame([[1,np.nan,3], [4,5,6], [np.nan,8,9]])
c
Out[15]:
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 3 |
1 | 4.0 | 5.0 | 6 |
2 | NaN | 8.0 | 9 |
In [16]:
c.sum()
Out[16]:
0 5.0
1 13.0
2 18.0
dtype: float64
通过函数检查数据中是否含有缺失值
检查单个空值
单个空值,底层类型为 np.nan,不能直接比较是否相同
In [17]:
3 == 3
2.5 == 2.5
np.nan == np.nan # 缺失值不能自己比较
Out[17]:
False
单个np.nan空值,可以用 np.isnan() 方法判断是否是空值
In [18]:
np.isnan(np.nan)
np.isnan(123)
Out[18]:
False
整体判断,表格中各列缺失值情况
In [19]:
c.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
0 2 non-null float64
1 2 non-null float64
2 3 non-null int64
dtypes: float64(2), int64(1)
memory usage: 152.0 bytes
isnull()和notnull()
- isnull():缺失值返回True,正常值返回False
- notnull():正常值返回True,缺失值返回False
In [20]:
b
Out[20]:
0 NaN
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [21]:
b.isnull()
Out[21]:
0 True
1 False
2 True
3 False
4 True
5 False
dtype: bool
In [22]:
-(b.isnull()) # 非运算,缺失值返回False
Out[22]:
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
In [23]:
b.notnull()
Out[23]:
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
返回所有正常值
手动过滤Series的缺失值
In [24]:
b
Out[24]:
0 NaN
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [25]:
b.notnull()
Out[25]:
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
In [26]:
b[b.notnull()]
Out[26]:
1 2.0
3 4.0
5 6.0
dtype: float64
DataFrame不能通过布尔查询方式过滤缺失值,必须使用Pandas的特定方法过滤
查到缺失值后,Series可以直接过滤,DataFrame需要进一步处理
In [27]:
c
Out[27]:
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 3 |
1 | 4.0 | 5.0 | 6 |
2 | NaN | 8.0 | 9 |
In [28]:
c.isnull()
c[c.isnull()]
Out[28]:
0 | 1 | 2 | |
---|---|---|---|
0 | NaN | NaN | NaN |
1 | NaN | NaN | NaN |
2 | NaN | NaN | NaN |
In [29]:
c.notnull()
c[c.notnull()]
Out[29]:
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 3 |
1 | 4.0 | 5.0 | 6 |
2 | NaN | 8.0 | 9 |
去除缺失值,只保留有效值
dropna()函数
In [30]:
b
Out[30]:
0 NaN
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [31]:
b.dropna()
Out[31]:
1 2.0
3 4.0
5 6.0
dtype: float64
In [32]:
# 等同于
b[b.notnull()]
Out[32]:
1 2.0
3 4.0
5 6.0
dtype: float64
In [33]:
c
Out[33]:
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 3 |
1 | 4.0 | 5.0 | 6 |
2 | NaN | 8.0 | 9 |
In [34]:
c.dropna() # 默认删除缺失值所在行
c.dropna(axis=0)
Out[34]:
0 | 1 | 2 | |
---|---|---|---|
1 | 4.0 | 5.0 | 6 |
In [35]:
c.dropna(axis=1) # 按列删除
Out[35]:
2 | |
---|---|
0 | 3 |
1 | 6 |
2 | 9 |
In [36]:
# 增加一列全部为缺失值的数据
c[3] = np.nan
c
Out[36]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | NaN | 8.0 | 9 | NaN |
In [37]:
# 行或列,有1个缺失值即删除
c.dropna(axis=1)
c.dropna(axis=1, how='any')
Out[37]:
2 | |
---|---|
0 | 3 |
1 | 6 |
2 | 9 |
In [38]:
# 行或列必须全部都是缺失值才删
c.dropna(axis=1, how='all')
Out[38]:
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 3 |
1 | 4.0 | 5.0 | 6 |
2 | NaN | 8.0 | 9 |
根据行或列的非缺失值数量衡量删除与否
In [39]:
c
Out[39]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | NaN | 8.0 | 9 | NaN |
In [40]:
c.dropna()
Out[40]:
0 | 1 | 2 | 3 |
---|
In [41]:
c.dropna(thresh=3) # 行非缺失值数量大于等于3个,保留
Out[41]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
1 | 4.0 | 5.0 | 6 | NaN |
填充缺失值
缺失值问题除了删除所在行列以外,还可以通过填充值解决
fillna()函数参数
In [42]:
b
Out[42]:
0 NaN
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [43]:
b.fillna(0) # 缺失值填充为0
Out[43]:
0 0.0
1 2.0
2 0.0
3 4.0
4 0.0
5 6.0
dtype: float64
In [44]:
b.fillna(b.mean()) # 缺失值填充为平均数
Out[44]:
0 4.0
1 2.0
2 4.0
3 4.0
4 4.0
5 6.0
dtype: float64
前向填充和后向填充
- method='ffill'
- method='bfill'
In [45]:
b
Out[45]:
0 NaN
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64
In [46]:
b.mean()
Out[46]:
4.0
前向填充:使用缺失值的前一个值填充
In [47]:
b.fillna(method='ffill')
Out[47]:
0 NaN
1 2.0
2 2.0
3 4.0
4 4.0
5 6.0
dtype: float64
后向填充,使用缺失值的后一个值填充
In [48]:
b.fillna(method='bfill')
Out[48]:
0 2.0
1 2.0
2 4.0
3 4.0
4 6.0
5 6.0
dtype: float64
In [49]:
c
Out[49]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | NaN | 8.0 | 9 | NaN |
In [50]:
c.fillna(method='ffill') # 前向填充,按行
Out[50]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | 4.0 | 8.0 | 9 | NaN |
In [51]:
c.fillna(method='bfill') # 后向填充,按行
Out[51]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 5.0 | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | NaN | 8.0 | 9 | NaN |
In [52]:
c.fillna(method='ffill', axis=1) # 按列,前向填充
Out[52]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 1.0 | 3.0 | 3.0 |
1 | 4.0 | 5.0 | 6.0 | 6.0 |
2 | NaN | 8.0 | 9.0 | 9.0 |
给各列分别填充不同值
In [53]:
c
Out[53]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | NaN | 8.0 | 9 | NaN |
In [54]:
c.fillna({0: 111, 1: 222, 2: 333, 3: 444})
Out[54]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 222.0 | 3 | 444.0 |
1 | 4.0 | 5.0 | 6 | 444.0 |
2 | 111.0 | 8.0 | 9 | 444.0 |
上面一切删除、填充操作都没有修改原变量
修改原值参数:inplace=True
In [55]:
c
Out[55]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3 | NaN |
1 | 4.0 | 5.0 | 6 | NaN |
2 | NaN | 8.0 | 9 | NaN |
In [56]:
# c.fillna(100)
c.fillna(100, inplace=True)
In [57]:
c
Out[57]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 100.0 | 3 | 100.0 |
1 | 4.0 | 5.0 | 6 | 100.0 |
2 | 100.0 | 8.0 | 9 | 100.0 |
连续填充数量
In [58]:
c.loc[3] = np.nan
c.loc[0, 1] = np.nan
c.loc[1:3,0] = np.nan
c[3] = np.nan
c
Out[58]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3.0 | NaN |
1 | NaN | 5.0 | 6.0 | NaN |
2 | NaN | 8.0 | 9.0 | NaN |
3 | NaN | NaN | NaN | NaN |
In [59]:
c.fillna(method='ffill') # 默认全部填充
Out[59]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3.0 | NaN |
1 | 1.0 | 5.0 | 6.0 | NaN |
2 | 1.0 | 8.0 | 9.0 | NaN |
3 | 1.0 | 8.0 | 9.0 | NaN |
In [60]:
c.fillna(method='ffill', limit=2) # 设置填充多少行或列
Out[60]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3.0 | NaN |
1 | 1.0 | 5.0 | 6.0 | NaN |
2 | 1.0 | 8.0 | 9.0 | NaN |
3 | NaN | 8.0 | 9.0 | NaN |
替换值
利用fillna方法填充缺失数据是值替换的一种特殊情况, replace方法用作替换值更简单、更灵活
In [61]:
data = pd.Series([1,-999,2,-999,-1000,3])
data
Out[61]:
0 1
1 -999
2 2
3 -999
4 -1000
5 3
dtype: int64
In [62]:
# 替换单值
data.replace(-999, np.nan)
Out[62]:
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
In [63]:
# 替换多值,多个替换为1个
data.replace([-999, -1000], np.nan)
Out[63]:
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
In [64]:
# 多个值替换为不同数值
data.replace([-999, -1000], [0, 1])
data.replace({-999: 0, -1000: 1})
Out[64]:
0 1
1 0
2 2
3 0
4 1
5 3
dtype: int64
映射数据替换
map除了自定义函数运算,还是一种映射转换元素以及其他数据清理工作的便捷方式
In [65]:
a = pd.DataFrame([['鬃刷','皮带','煎蛋','观赏'], [10,20,30,40]]).T
a
Out[65]:
0 | 1 | |
---|---|---|
0 | 鬃刷 | 10 |
1 | 皮带 | 20 |
2 | 煎蛋 | 30 |
3 | 观赏 | 40 |
In [66]:
y = {'鬃刷': '猪', '皮带': '牛', '观赏': '鱼', '衣服': '棉花'}
y
Out[66]:
{'鬃刷': '猪', '皮带': '牛', '观赏': '鱼', '衣服': '棉花'}
In [67]:
a[0].map(y)
Out[67]:
0 猪
1 牛
2 NaN
3 鱼
Name: 0, dtype: object
移除重复数据
移除DataFrame的重复行
In [68]:
data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,'k2':[1,1,2,3,3,4,4]})
data
Out[68]:
k1 | k2 | |
---|---|---|
0 | one | 1 |
1 | one | 1 |
2 | one | 2 |
3 | two | 3 |
4 | two | 3 |
5 | two | 4 |
6 | two | 4 |
In [69]:
# 布尔型Series,各列重复值交集
data.duplicated()
Out[69]:
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
In [70]:
# 移除重复行
data.drop_duplicates()
data[-(data.duplicated())]
Out[70]:
k1 | k2 | |
---|---|---|
0 | one | 1 |
2 | one | 2 |
3 | two | 3 |
5 | two | 4 |
In [71]:
# 移除自定义列重复行
data.drop_duplicates('k1')
Out[71]:
k1 | k2 | |
---|---|---|
0 | one | 1 |
3 | two | 3 |
In [72]:
#keep : {‘first’, ‘last’, False},
# first默认留下第一次出现的值
data.drop_duplicates(['k1', 'k2'], keep='first')
Out[72]:
k1 | k2 | |
---|---|---|
0 | one | 1 |
2 | one | 2 |
3 | two | 3 |
5 | two | 4 |
In [73]:
# last,留下最后一次出现的值
data.drop_duplicates(['k1', 'k2'], keep='last')
Out[73]:
k1 | k2 | |
---|---|---|
1 | one | 1 |
2 | one | 2 |
4 | two | 3 |
6 | two | 4 |
In [74]:
# False,删掉所有重复值
data.drop_duplicates(['k1', 'k2'], keep=False)
Out[74]:
k1 | k2 | |
---|---|---|
2 | one | 2 |
移除重复索引值
In [75]:
obj = pd.Series(range(5), index = ['a','a','b','b','c'])
obj
Out[75]:
a 0
a 1
b 2
b 3
c 4
dtype: int64
In [76]:
obj['a']
Out[76]:
a 0
a 1
dtype: int64
In [77]:
obj.index.duplicated()
Out[77]:
array([False, True, False, True, False])
In [78]:
obj.loc[~(obj.index.duplicated())]
Out[78]:
a 0
b 2
c 4
dtype: int64
检测和过滤异常值
过滤或变换异常值(outlier)在很大程度上就是运用数组运算
例子:一个含有正态分布数据的DataFrame
目标:将每列数据的范围控制在 -3 到 3 之间
In [79]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.loc[:10]
Out[79]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.056350 | -0.815325 | -1.739909 | -0.044069 |
1 | -0.342930 | 2.049838 | 0.571770 | 1.431593 |
2 | 0.155038 | -0.019762 | 0.365701 | -0.405128 |
3 | 0.896189 | -0.372501 | -0.289871 | 0.579180 |
4 | 0.619782 | -2.591634 | -1.116019 | 1.642877 |
5 | -1.558965 | 1.861874 | -0.620186 | 1.222442 |
6 | -1.404131 | -0.230156 | -1.057523 | 0.781039 |
7 | 0.329895 | -1.247459 | 1.360716 | 0.352324 |
8 | -2.796119 | -0.783126 | 0.071594 | 0.071109 |
9 | -0.638204 | 0.348094 | -0.784782 | 0.304675 |
10 | -0.125829 | -1.645295 | 0.080631 | 0.105201 |
In [80]:
# 查看数据整体指标
data.describe()
Out[80]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | -0.039946 | -0.020816 | 0.061418 | 0.001750 |
std | 1.025141 | 0.989867 | 0.999929 | 0.995840 |
min | -3.372125 | -3.244717 | -3.219369 | -2.866783 |
25% | -0.757997 | -0.730113 | -0.626054 | -0.698372 |
50% | -0.048779 | -0.029766 | 0.075186 | 0.004677 |
75% | 0.644870 | 0.623295 | 0.744526 | 0.663374 |
max | 3.512715 | 2.932729 | 2.903664 | 3.288854 |
In [81]:
# 找出某列中绝对值大于3的值
data[2][(data[2] > 3) | (data[2] < -3)]
data[2][np.abs(data[2]) > 3]
Out[81]:
87 -3.141451
631 -3.219369
Name: 2, dtype: float64
In [82]:
# 找出全部绝对值大于3的值所在的行
data[np.abs(data) > 3]
data[np.abs(data) > 3].any(axis=1)
data[(np.abs(data) > 3).any(1)]
Out[82]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
87 | -0.159197 | 0.751127 | -3.141451 | -0.092924 |
88 | 3.032059 | 1.106646 | -0.762529 | 0.109455 |
228 | 0.313696 | 1.587848 | 1.495230 | 3.288854 |
246 | 3.061156 | -1.015008 | -0.965786 | -0.535780 |
310 | 0.900701 | -0.860197 | 2.287573 | 3.003777 |
567 | 3.150045 | -1.160207 | 2.412660 | -0.850230 |
631 | -0.658641 | -0.355802 | -3.219369 | -0.667448 |
703 | -0.813356 | -3.244717 | 0.996636 | 1.243210 |
726 | -3.372125 | 0.621298 | -0.264892 | 0.716488 |
806 | 3.512715 | 1.699141 | -0.632934 | -0.975362 |
In [83]:
# 将数据范围限制在3到-3之间(大于3的改为3,小于-3的改为-3)
np.sign(data) # 判断数据正负,正数1,负数-1,生成一个对应数据的1,-1数据
data[np.abs(data) > 3] = np.sign(data) * 3
In [84]:
data.describe()
Out[84]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | -0.040330 | -0.020571 | 0.061779 | 0.001457 |
std | 1.021605 | 0.989099 | 0.998788 | 0.994916 |
min | -3.000000 | -3.000000 | -3.000000 | -2.866783 |
25% | -0.757997 | -0.730113 | -0.626054 | -0.698372 |
50% | -0.048779 | -0.029766 | 0.075186 | 0.004677 |
75% | 0.644870 | 0.623295 | 0.744526 | 0.663374 |
max | 3.000000 | 2.932729 | 2.903664 | 3.000000 |