第一个机器学习模型
使用Pandas 读取数据
机器学习的第一步是处理数据,对于很多表格数据(如.csv文件),我们使用pandas
库来处理。
加载数据并查看数据总体情况:
import pandas as pd
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
home_data = pd.read_csv(iowa_file_path)
home_data.describe()
选择所需数据
使用home_data.colums
查看表格所有列的名称。
其中有一列为SalePrice
,是我们需要的房价。我们把它赋值给y:y= home_data.SalePrice
。
接着,我们要获取一些和房价有关的特征,赋值给X。
feature_names = ['LotArea','YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr',
'TotRmsAbvGrd']
# Select data corresponding to features in feature_names
X = home_data[feature_names]
确认一下我们获取的数据有没有异常值:
print(X.describe())
# print the top few lines
print(X.head())
创建模型(并训练)
这里比较简单,直接用一个sklearn的决策树模型:
from sklearn.tree import DecisionTreeRegressor
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit the model
iowa_model.fit(X, y)
用模型预测
predictions = iowa_model.predict(X)
print(predictions)
模型验证 Model Validation
划分训练集和验证集
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
这次只在训练集上fit:
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit iowa_model with the training data.
iowa_model.fit(train_X,train_y)
# Check your answer
step_2.check()
并在验证集上预测:
val_predictions = iowa_model.predict(val_X)
计算实际值和预测值之间误差
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)
# uncomment following line to see the validation_mae
print(val_mae)