Data quality includes
- missing
- inconsistent
- invaild
- implausible(难以置信的)
Data preparation workflow
1: How to use data profiling(剖析) methods to
Characterise data and provide high-level insights
Investigate data quality so it may be cleanedData preparation workflow includes three steps
Firstly, Discover
What data sources and level of detail
What spatio-temporal coverage(时空覆盖) and costSecondly, Wrangle(争辩)
**Read in data, reformat(重新格式化), transform(转换), link(链接)Profile
Rigorous investigation of data quality
Subset of Data preparation
I: Look at your data
Number of rows
Example of Values
Data Formate
Data Type
How is it encoded?- Why people must care for Data Encoded
Explain: If you use anything other than the most basic English text, people may not be able to read your data unless you state the character encoding
- Why people must care for Data Encoded
- File size & number of rows
- Check the data types
Check the format yourself
Don’t rely on heuristics(启发法)
Don’t assume that all your data files use the same format, even if the files come from one source
- Check the data types
- Example values
II: read your data correctly ---->Watch out for special values
III:Is all the data there?
1:Missing values
Terrible statistical terminology
Advantages of visualization1.1: Missing at random(MAR)
-Related to other variables
– Term is misleading!1.2: Missing completely at random (MCAR)
– Haphazard
– Unrelated to values of variable, or other variables1.3: Missing not at random (MNAR)
– Related to values of the variable itself2:Coverage (e.g. temporal or geographic)
2.1: Temporal coverage
2.2: Spatialcoverage
3:Duplicates(重复值)
IV: Rigorously check data quality
How to write data validation rules
1.1: Subject-matter special lists typically use free text to describe valid values and explain how to clean them
1.2: Data scientist may need to write validation & cleaning rules as pseudocode