本文主要目的是总结自己思路。关于泰坦尼克(Titanic)的生存分析在网上大多比较深入,我自己看过一部分然后进行了比较多的练习后还是觉得自己要做一定的整理才能对相关函数和模型有更好的认识。
下面是这次的总结,分析集中于清洗、可视化和使用模型进行预测。
平台:jupyter notebook
数据初探
设定绘图样式、画布中文标题和全局参数
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")
sns.set_style('whitegrid',{'font.sans-serif':['simhei']})
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
导入数据
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')
train = train_data.copy()
test = test_data.copy()
查看数据概况
train_data.head()
train_data.info()
test_data.info()
train_data.describe(include=['object'])
可能与生存相关的数据:
1.pclass:客舱等级,头等舱有身份的人士更多;
2.sex:性别,女士优先;
3.age:年龄,尊老爱幼;
4.sibsp:兄弟姐妹,可能亲属多的获救概率更大;
5.parch:父母和小孩,可能会让父母子女先得救;
6.fare:船票价格,跟客舱等级应该存在关联;
7.embarked :登船处,我认为登陆地点不同,可能显示人的地位之类的不一样;
8.name: 姓名一般带有身份或者地位标志;
数据清洗
补充缺失值:
从trian_info图中可以看出,embarked缺失最少,先补充这列
train[train.Embarked.isnull() == True]
登船处缺失的两个人正好都是女性,将登船处与票价、舱等绘制箱线图:
fig, ax = plt.subplots(figsize=(16,12),ncols=2)
ax1 = sns.boxplot(x="Embarked", y="Fare", hue="Pclass", data=train[train.Sex == 'female'], ax = ax[0]);
ax2 = sns.boxplot(x="Embarked", y="Fare", hue="Pclass", data=test_data[test_data.Sex == 'female'], ax = ax[1]);
ax1.set_title("Training Set", fontsize = 18)
ax2.set_title('Test Set', fontsize = 18)
fig.show()
看来C最符合
train.Embarked.fillna('C', inplace = True)
补充test数据集fare项:
farevalue = test[(test.Pclass == 3) & (test.Embarked == "S") & (test.Sex == "male")].Fare.mean()
test.Fare.fillna(farevalue, inplace=True)
剩下的数据缺失项是Age年龄和carbin舱号。舱号比较不重要,并且缺失过多(缺78%),用‘U’代替缺失值。
train.Cabin.fillna('U',inplace=True)
test.Cabin.fillna('U',inplace=True)
处理年龄前先划分名字中的信息:注意到在乘客名字(Name)中,有一个非常显著的特点:乘客头衔每个名字当中都包含了具体的称谓或者说是头衔,将这部分信息提取出来后可以作为非常有用一个新变量,可以帮助预测。
all_data = pd.concat([train, test], ignore_index = True)
all_data['Title'] = all_data['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
Title_Dict = {}
Title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
Title_Dict.update(dict.fromkeys(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
Title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
Title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
Title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
Title_Dict.update(dict.fromkeys(['Master','Jonkheer'], 'Master'))
all_data['Title'] = all_data['Title'].map(Title_Dict)
sns.barplot(x="Title", y="Survived", data=all_data, palette='Set3')
补充Age,一般的方法是用中位数和平均数代替。这样的处理方式虽然能保证数据的整体性,但是容易丢失数据间差异和关联。现尝试用二折交叉验证(Cross-Validation)补全数据。
from sklearn import cross_validation
train = all_data[all_data['Survived'].notnull()]
test = all_data[all_data['Survived'].isnull()]
#将训练集等分
train_split_1, train_split_2 = cross_validation.train_test_split(train, test_size=0.5, random_state=0)
def predict_age_use_cross_validationg(df1,df2,dfTest):
age_df1 = df1[['Age', 'Pclass','Sex','Title']]
age_df1 = pd.get_dummies(age_df1)
age_df2 = df2[['Age', 'Pclass','Sex','Title']]
age_df2 = pd.get_dummies(age_df2)
known_age = age_df1[age_df1.Age.notnull()].as_matrix()
unknow_age_df1 = age_df1[age_df1.Age.isnull()].as_matrix()
unknown_age = age_df2[age_df2.Age.isnull()].as_matrix()
print (unknown_age.shape)
y = known_age[:, 0]
X = known_age[:, 1:]
rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)
rfr.fit(X, y)
predictedAges = rfr.predict(unknown_age[:, 1::])
df2.loc[ (df2.Age.isnull()), 'Age' ] = predictedAges
predictedAges = rfr.predict(unknow_age_df1[:,1::])
df1.loc[(df1.Age.isnull()),'Age'] = predictedAgesdaa
age_Test = dfTest[['Age', 'Pclass','Sex','Title']]
age_Test = pd.get_dummies(age_Test)
age_Tmp = df2[['Age', 'Pclass','Sex','Title']]
age_Tmp = pd.get_dummies(age_Tmp)
age_Tmp = pd.concat([age_Test[age_Test.Age.notnull()],age_Tmp])
known_age1 = age_Tmp.as_matrix()
unknown_age1 = age_Test[age_Test.Age.isnull()].as_matrix()
y = known_age1[:,0]
x = known_age1[:,1:]
rfr.fit(x, y)
predictedAges = rfr.predict(unknown_age1[:, 1:])
dfTest.loc[ (dfTest.Age.isnull()), 'Age' ] = predictedAges
return dfTest
t1 = train_split_1.copy()
t2 = train_split_2.copy()
tmp1 = test.copy()
t5 = predict_age_use_cross_validationg(t1,t2,tmp1)
t1 = pd.concat([t1,t2])
t3 = train_split_1.copy()
t4 = train_split_2.copy()
tmp2 = test.copy()
t6 = predict_age_use_cross_validationg(t4,t3,tmp2)
t3 = pd.concat([t3,t4])
train['Age'] = (t1['Age'] + t3['Age'])/2
test['Age'] = (t5['Age'] + t6['Age']) / 2
print (train.describe())
print (test.describe())
all_data = pd.concat([train,test])