1.导入模块
>>> import pandas as pd
2.解决DataFrame中的行列显示不全问题
>>> pd.set_option('display.max_rows', 100,'display.max_columns', 1000,"display.max_colwidth",1000,'display.width',1000)
3.导入数据表格
>>> titanic = pd.read_csv(r"C:\Users\Administrator\Desktop\titanic.csv")
4.统计平均年龄
>>> titanic["Age"].mean()
29.69911764705882
默认会跳过空值,并不会跨行统计
5.统计年龄和票价中位数
>>> titanic[["Age", "Fare"]].median()
Age 28.0000
Fare 14.4542
dtype: float64
6.多列数据统计,函数自定义统计值
>>> titanic[["Age", "Fare"]].describe()
Age Fare
count 714.000000 891.000000
mean 29.699118 32.204208
std 14.526497 49.693429
min 0.420000 0.000000
25% 20.125000 7.910400
50% 28.000000 14.454200
75% 38.000000 31.000000
max 80.000000 512.329200
7.多列数据统计,自定义统计值
>>> titanic.agg({'Age': ['min', 'max', 'median', 'skew'],
'Fare': ['min', 'max', 'median', 'mean']})
... Age Fare
max 80.000000 512.329200
mean NaN 32.204208
median 28.000000 14.454200
min 0.420000 0.000000
skew 0.389108 NaN
8.按类别分组统计
>>> titanic.groupby("Sex").mean() #按性别统计各类别的平均值
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682 44.479818
male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702 25.523893
>>> titanic.groupby("Sex")["Age"].mean() #按性别统计年龄的平均值
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64
>>> titanic.groupby(["Sex", "Pclass"])["Fare"].mean() #按性别和机舱舱位组合统计平均票价
Sex Pclass
female 1 106.125798
2 21.970121
3 16.118810
male 1 67.226127
2 19.741782
3 12.661633
Name: Fare, dtype: float64
9.按类别统计其个数
>>> titanic.groupby("Pclass")["Pclass"].count()
Pclass
1 216
2 184
3 491
Name: Pclass, dtype: int64
>>>
>>> titanic["Pclass"].value_counts()
3 491
1 216
2 184
Name: Pclass, dtype: int64
value_counts()方法计算列中每个类别的记录数,该函数是一个快捷方式,它实际上是一个groupby操作