pandas
pandas中主要有两种数据结构,分别是:Series和DataFrame.
- Series:一种类似于一维数组的对象,是由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。仅由一组数据也可产生简单的Series对象。注意:Series中的索引值是可以重复的。
- DataFrame:一个表格型的数据结构,包含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔型等),DataFrame即有行索引也有列索引,可以被看做是由Series组成的字典。
Series
通过一维数组创建Series
code:
from pandas import Series,DataFrame
import pandas as pd
import numpy as np
a1 = np.array(["Python","C++","Java","PHP"])
ser1 = Series(a1)
print(ser1) # 输出包含默认的序列号
print(ser1.dtype)
print(ser1.index)
print(ser1.values)
out:
0 Python
1 C++
2 Java
3 PHP
dtype: object
object
RangeIndex(start=0, stop=4, step=1)
['Python' 'C++' 'Java' 'PHP']
code:
ser1.index = ["one","two","three","four"]
print(ser1)
out:
one Python
two C++
three Java
four PHP
dtype: object
code:
ser2 = Series(data = [78,90,65,92],dtype = np.float64,index = ["Jim","HanMei","LiLei","Havorld"])
print(ser2)
out:
Jim 78.0
HanMei 90.0
LiLei 65.0
Havorld 92.0
dtype: float64
通过字典的方式创建Series
code
dict1= {"Jim":84,"HanMei":68,"Havorld":96}
ser2 = Series(dict1)
print(ser2) # 字典的key组成Series的索引,Value组成Series的值
out
HanMei 68
Havorld 96
Jim 84
dtype: int64
Series值的获取
ser3 = Series(data = [78,90,65,92],dtype = np.float64,index = ["Jim","HanMei","LiLei","Havorld"])
print(ser3)
输出:
HanMei 68
Havorld 96
Jim 84
dtype: int64
print(ser3[1])
print(ser3["Havorld"])
print(ser3[-2]) #负数表示从右向左算
输出:
HanMei 90.0
LiLei 65.0
Havorld 92.0
dtype: float64
print(ser3[1:])
输出:
Havorld 96
Jim 84
dtype: int64
print(ser3["Havorld":"Jim"])
输出:
Havorld 96
Jim 84
dtype: int64
Series的运算
- NumPy中的数组运算,在Series中都保留了,均可以使用,并且Series进行数组运算的时候,索引与值之间的映射关系不会发生改变。
- 在操作Series的时候,基本上可以把Series看成NumPy中的ndarray数组来进行操作。ndarray数组的绝大多数操作都可以应用到Series上。
Series缺失值检测
ser4 = Series({"Jim":84,"HanMei":68,"Havorld":96})
print(ser4)
输出:
HanMei 68
Havorld 96
Jim 84
dtype: int64
new_index={"Jim","Lucy","HanMei","Havorld"}
ser4 = Series(ser4,index=new_index)
print(ser4)
输出:
Jim 84.0
Lucy NaN
HanMei 68.0
Havorld 96.0
dtype: float64
ser5 = pd.isnull(ser4) #判断是否为空
print(ser5)
输出:
Jim False
Lucy True
HanMei False
Havorld False
dtype: bool
ser6 = pd.notnull(ser4) #判断是否为非空
print(ser6)
输出:
Jim True
Lucy False
HanMei True
Havorld True
dtype: bool
Series之间的运算
当多个series对象之间进行运算的时候,series之间相同key值的元素value进行运算,不同索引key的value赋值为NaN。
Series及其索引的name属性
ser7 = Series({"Jim":84,"HanMei":68,"Havorld":96})
ser7.index.name = "成绩单"
ser7.name = "语文成绩"
print(ser7)
输出:
成绩单
HanMei 68
Havorld 96
Jim 84
Name: 语文成绩, dtype: int64
DataFrame
通过二维数组创建DataFrame
arr = np.array([
["China","USA","English"],
[16,12,100]
])
df1 = DataFrame(arr)
print(df1)
输出:
0 1 2 列索引:columns
0 China USA English 数据:values
1 16 12 100 数据:values
行索引:index
创建并指定列和行属性
df2 = DataFrame(arr,columns = ["one","two","three"],index = ["一","二"])
print(df2)
输出:
one two three
一 China USA English
二 16 12 100
print(df2.columns)
print(df2.index)
print(df2.values)
输出:
Index(['one', 'two', 'three'], dtype='object')
Index(['一', '二'], dtype='object')
[['China' 'USA' 'English']
['16' '12' '100']]
通过字典的方式创建DataFrame
dict2= {"day":[1,24,12,25],"month":[5,7,3,12],"year":[1990,2001,1997,2018]}
df3 = DataFrame(dict2)
print(df3)
输出:
day month year
0 1 5 1990
1 24 7 2001
2 12 3 1997
3 25 12 2018
#修改默认索引
df3.index = ["one","two","three","four"]
print(df3)
输出:
day month year
one 1 5 1990
two 24 7 2001
three 12 3 1997
four 25 12 2018
DataFrame数据获取
dict2= {"day":[1,24,12,25],"month":[5,7,3,12],"year":[1990,2001,1997,2018]}
df3 = DataFrame(dict2)
df3.index = ["one","two","three","four"]
print(df3)
输出:
day month year
one 1 5 1990
two 24 7 2001
three 12 3 1997
four 25 12 2018
print(df3["year"]) # 根据索引取列
print(df3.ix["two"]) #根据索引取行
输出:
one 1990
two 2001
three 1997
four 2018
Name: year, dtype: int64
day 24
month 7
year 2001
Name: two, dtype: int64
df3["century"] = 21 #新增列
df3.ix["five"] = np.NaN #新增行
print(df3)
输出:
day month year century
one 1.0 5.0 1990.0 21.0
two 24.0 7.0 2001.0 21.0
three 12.0 3.0 1997.0 21.0
four 25.0 12.0 2018.0 21.0
five NaN NaN NaN NaN
pandas基本功能
- 数据文件读取/文本数据读取
- 索引、选取和数据过滤
- 算法运算和数据对齐
- 函数的应用和映射
- 重置索引
pandas本地读取数据
read1 = pd.read_csv("E:/Users/Havorld/Desktop/data.csv")
print(read1)
输出:
name age source
0 gerry 18 98.5
1 tom 21 78.2
2 lili 24 98.5
3 john 20 89.2
# 读取文本数据,指定属性分隔符为";" 不读取头数据
read2 = pd.read_csv("data.txt",sep=";",header = None)
print(read2)
输出:
0 1 2 3 4
0 gerry 18 98.5 89.5 88.5
1 tom 21 98.5 85.5 80.0
2 lili 20 85.6 86.2 NaN
3 john 18 70.0 85.0 60.0
4 joe 19 80.0 85.0 82.0
-
read_csv常用参数:
pandas数据过滤获取
read2.columns = {"name","age",u"语文",u"数学",u"英语"} #指定列名
print(read2)
age 数学 语文 英语 name
0 gerry 18 98.5 89.5 88.5
1 tom 21 98.5 85.5 80.0
2 lili 20 85.6 86.2 NaN
3 john 18 70.0 85.0 60.0
4 joe 19 80.0 85.0 82.0
read3 = read2[read2.columns[2:]] #取出指定的数据
print(read3)
语文 英语 name
0 98.5 89.5 88.5
1 98.5 85.5 80.0
2 85.6 86.2 NaN
3 70.0 85.0 60.0
4 80.0 85.0 82.0
read4 = read3.dropna() #删除含有NaN的行
print(read4)
语文 英语 name
0 98.5 89.5 88.5
1 98.5 85.5 80.0
3 70.0 85.0 60.0
4 80.0 85.0 82.0
选取数据loc,iloc,ix
import numpy as np
import pandas as pd
#生产数据
df = pd.DataFrame(np.arange(0,60,2).reshape(10,3),columns=list('abc'))
print(df)
# loc通过行引用row index和列名column names选取数据
#取第0行第b列的值
print(df.loc[0, 'b'])
#取第0行至第3行的ab列
print(df.loc[0:3, ['a', 'b']])
#取第1行和第5行的bc列
print(df.loc[[1, 5], ['b', 'c']])
# iloc通过行引用row index和列引用column index选取数据
print(df.iloc[0,1])
print(df.iloc[0:4, [0,1]])
print(df.iloc[[1, 5], 1:3])
# ix既可以通过行引用row index和列名column names选取数据,又可以通过行引用row index和列引用column index选取数据
print(df.ix[0,"b"])
print(df.ix[0,1])
print(df.ix[0:3,["a","b"]])
print(df.ix[0:3,[0,1]])
print(df.ix[[1,5],["b","c"]])
print(df.ix[[1,5],[1,2]])
pandas缺省值NaN处理方法
- dropna:根据标签的值中是否存在缺失数据对轴标签进行过滤(删除), 可以通过阈值的调节对缺失值的容忍度
- fillna:用指定值或者插值的方式填充缺失数据,比如: ffill或者bfill
- isnull: 返回一个含有布尔值的对象,这些布尔值表示那些值是缺失值NA
- notnull: isnull的否定式
df5=DataFrame([
['Tom',np.NaN,456.67,'M'],['Merry',34,456.67,np.NaN],
['Gerry',np.NaN,np.NaN,np.NaN],['John',23,np.NaN,'M'],
['Joe',18,2300,'F']],columns=['name','age','salary','Gender']
)
print(df5)
name age salary Gender
0 Tom NaN 456.67 M
1 Merry 34.0 456.67 NaN
2 Gerry NaN NaN NaN
3 John 23.0 NaN M
4 Joe 18.0 2300.00 F
df5.dropna() #dropna删除行中包含NaN的行数据
df5.dropna(axis=1) #删除列中包含NaN的列(axis=0为行)数据
df5.dropna(how='all') #丢弃全部为NaN值的行数据
df6 = DataFrame(np.random.randn(7,3))
print(df6)
0 1 2
0 0.280872 -1.890914 -0.237311
1 0.721152 -0.300591 0.285356
2 -1.748477 0.991288 -0.349774
3 -1.678800 -0.608380 -0.002143
4 -1.273338 0.946480 -1.179870
5 -0.533472 0.669000 0.667644
6 1.339726 0.119211 -1.016756
df6.ix[:4,2] = np.nan #把0-4行第2列的的数值改为NaN
print(df6)
0 1 2
0 0.280872 -1.890914 NaN
1 0.721152 -0.300591 NaN
2 -1.748477 0.991288 NaN
3 -1.678800 -0.608380 NaN
4 -1.273338 0.946480 NaN
5 -0.533472 0.669000 0.667644
6 1.339726 0.119211 -1.016756
df7 = df6.fillna(0)
print(df7)
0 1 2
0 0.280872 -1.890914 0.000000
1 0.721152 -0.300591 0.000000
2 -1.748477 0.991288 0.000000
3 -1.678800 -0.608380 0.000000
4 -1.273338 0.946480 0.000000
5 -0.533472 0.669000 0.667644
6 1.339726 0.119211 -1.016756
pandas常用的数学统计方法
df8 = read3
df8 = df8.dropna()
print(df8)
name 语文 数学
0 98.5 89.5 88.5
1 98.5 85.5 80.0
3 70.0 85.0 60.0
4 80.0 85.0 82.0
# 针对Series或各DataFrame列计算总统计值
print(df8.describe())
name 语文 数学
count 4.000000 4.000000 4.000000
mean 86.750000 86.250000 77.625000
std 14.168627 2.179449 12.297527
min 70.000000 85.000000 60.000000
25% 77.500000 85.000000 75.000000
50% 89.250000 85.250000 81.000000
75% 98.500000 86.500000 83.625000
max 98.500000 89.500000 88.500000
print(df8.count())
print(df8.count(axis = 1))
name 4
语文 4
数学 4
dtype: int64
0 3
1 3
3 3
4 3
dtype: int64
相关系数与协方差
唯一值、值计数以及成员资格
- unique:数组去重
- value_counts:计算Series中各个元素出现的频率
- isin:判断矢量化集合的元素是否是Series或DataFrame中数据的子集
s = Series(["a","b","b","d","c"])
print(s.value_counts())
print(s.isin(["a","b"]))
print(s.unique())
输出:
b 2
d 1
c 1
a 1
dtype: int64
0 True
1 True
2 True
3 False
4 False
dtype: bool
['a' 'b' 'd' 'c']
层次索引
data = Series([768,325,914,666],index=[
["2015","2015","2015","2016"],
["apple","banana","orange","apple"]
])
print(data)
2015 apple 768
banana 325
orange 914
2016 apple 666
dtype: int64
code:
df9 = DataFrame({
"year":[2001,2001,2002,2002,2003],
"fruit":["apple","banana","apple","banana","apple"],
"production":[121,122,123,124,125],
"profits":[22.1,22.2,22.3,22.4,22.5]
})
print(df9)
fruit production profits year
0 apple 121 22.1 2001
1 banana 122 22.2 2001
2 apple 123 22.3 2002
3 banana 124 22.4 2002
4 apple 125 22.5 2003
df9 = df9.set_index(["year","fruit"]) # 把year和fruit合并(方便计算某一年水果的情况)
print(df9)
production profits
year fruit
2001 apple 121 22.1
banana 122 22.2
2002 apple 123 22.3
banana 124 22.4
2003 apple 125 22.5
print(df9.ix[2002,"apple"]) #展示2002年的香蕉情况
print(df9.ix[2002]) #展示2002年的水果情况
production 123.0
profits 22.3
Name: (2002, apple), dtype: float64
production profits
fruit
apple 123 22.3
banana 124 22.4
df9 = df9.sum(level="year") # 以年为单位production,profits 相加
print(df9)
production profits
year
2001 243 44.3
2002 247 44.7
2003 125 22.5
join函数
merge函数
可以使用help(pd.merge)查看函数的帮助
def merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=False,
suffixes=('_x', '_y'), copy=True, indicator=False)
参数how : {'left', 'right', 'outer', 'inner'},默认参数为'inner'
left:以左边的df为主键进行合并,right:以右边的df为主键进行合并
import pandas as pd
from pandas import DataFrame
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key': ['K0', 'K1', 'K0', 'K1']})
right = pd.DataFrame({'C': ['C0', 'C1', "C2"],
'D': ['D0', 'D1', "D2"],
'K': ['K0', 'K1', "K0"]},
index=['zero', 'one', "two"])
print(left)
print(right)
result = pd.merge(left, right, how='left', left_on='key', right_on="K",
sort=False);
print(result)
输出:
A B key
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K0
3 A3 B3 K1
C D K
zero C0 D0 K0
one C1 D1 K1
two C2 D2 K0
left:
A B key C D K
0 A0 B0 K0 C0 D0 K0
1 A0 B0 K0 C2 D2 K0
2 A1 B1 K1 C1 D1 K1
3 A2 B2 K0 C0 D0 K0
4 A2 B2 K0 C2 D2 K0
5 A3 B3 K1 C1 D1 K1
以左边的键位主键:
1.先是left的k0对应right的2个k0
2.是left的k1对应right的1个k0
3.是left的k0对应right的2个k0
1.是left的k1对应right的1个k1
以右边的键为主键同上
right:
A B key C D K
0 A0 B0 K0 C0 D0 K0
1 A2 B2 K0 C0 D0 K0
2 A0 B0 K0 C2 D2 K0
3 A2 B2 K0 C2 D2 K0
4 A1 B1 K1 C1 D1 K1
5 A3 B3 K1 C1 D1 K1
参数left_on和right_on
left_on:合并时,左边的键
right_on:合并时,右边的键
agg函数
apply函数
mDataFram["score"]= mSeries.apply(####)
mDataFram["score"] = mDataFram.apply(####)
apply中的函数对Series进行操作后再返回回来
也可以有多返回:
mDataFram[["score","count"]] = mDataFram.apply(####)
mDataFram["score"], mDataFram["count"] = zip(*mDataFram.apply(####))
groupby函数
import numpy as np
from pandas import DataFrame
df = DataFrame(
{'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
print(df)
print("--------")
grouped1 = df['data1'].groupby(df['key1'])
print(grouped1.mean())
print("--------")
grouped2 = df['data1'].groupby(df['key2'])
print(grouped2.mean())
输出:
data1 data2 key1 key2
0 -2.589168 -0.733088 a one
1 0.807556 -0.396627 a two
2 -0.425544 -0.007338 b one
3 -1.867421 -1.037650 b two
4 0.851296 0.548271 a one
--------
key1
a -0.310106
b -1.146482
Name: data1, dtype: float64
--------
key2
one -0.721139
two -0.529933
Name: data1, dtype: float64