This exercise will test your ability to read a data file and understand statistics about the data.
本次练习将考验你读取文件数据并理解数据统计信息的能力
In later exercises, you will apply techniques to filter the data, build a machine learning model, and iteratively improve your model.
在后续的练习里,你将应用这些技术去筛选数据,构建机器学习模型,以及通过迭代优化你的模型
The course examples use data from Melbourne. To ensure you can apply these techniques on your own, you will have to apply them to a new dataset (with house prices from Iowa).
课程的样例使用墨尔本的数据,为了确保你可以自己应用这些技巧,你必须将他们应用到新的数据集上(爱荷华州的房价)
Exercises 练习
Run the following cell to set up code-checking, which will verify your work as you go.
运行下面的单元以初始化代码检查,这将在你进行的同时帮你做验证
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex2 import *
print("Setup Complete")
Step 1: Loading Data 加载数据
Read the Iowa data file into a Pandas DataFrame called home_data
.
将文件中爱荷华州的数据读取到Pandans的DataFrame中,命名为'home_data'
import pandas as pd
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
# Fill in the line below to read the file into a variable home_data
home_data = pd.read_csv(iowa_file_path)
# Call line below with no argument to check that you've loaded the data correctly
step_1.check()
Step 2: Review The Data 预览数据
Use the command you learned to view summary statistics of the data. Then fill in variables to answer the following questions
# Print summary statistics in next line
print(home_data)
# What is the average lot size (rounded to nearest integer)?
avg_lot_size = ____
# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = ____
# Checks your answers
step_2.check()
Think About Your Data 想想你的数据
The newest house in your data isn't that new. A few potential explanations for this:
在你的数据里最新的房屋并不是最新的,一些可能的解释为:
- They haven't built new houses where this data was collected.
在这些数据被收集时,哪些房屋还没有被建起来
- The data was collected a long time ago. Houses built after the data publication wouldn't show up.
数据是很久之前收集的,房屋是在数据被收集之后才建造的
If the reason is explanation #1 above, does that affect your trust in the model you build with this data? What about if it is reason #2?
如果将以上的原因1作为解释,它会影响你基于当前数据构建模型的信任度嘛?如果是原因2呢?
How could you dig into the data to see which explanation is more plausible?
你怎样才能深入数据去找到哪一种解释更可信?
Check out this discussion thread to see what others think or to add your ideas.
通过这个讨论 去看看其他人是怎么想的或者提出你的想法
Keep Going 继续
You are ready for Your First Machine Learning Model.
你已经准备好开始你的第一个机器学习模型了