【R for Data Science】(3) Data Transformation

通常我们的数据不能直接用于可视化处理，因此我们要对它们进行转化整理（transform），比如创建新的变量或重命名变量或者重新整理观测值的顺序等等。

1. 安装

这里使用 nycflights13 和 tidyverse 两个包，其中主要用到 dplyr 包中函数：

library(nycflights13)
library(tidyverse)

nycflights13 中的 flights 数据对象含有 336776 个 2013 年纽约的航班信息：

flights.png

注意：

int stands for integers.

dbl stands for doubles, or real numbers.

chr stands for character vectors, or strings.

dttm stands for date-times (a date + a time).

lgl stands for logical, vectors that contain only TRUE or FALSE.

fctr stands for factors, which R uses to represent categorical variables with fixed possible values.

date stands for dates.
dplyr basics

filter() : Pick observations by their values.

arrange() : Reorder the rows.

select() : Pick variables by their names.

mutate() : Create new variables with functions of existing variables.

summarise() : Collapse many values down to a single summary.

2. `filter()` 筛选观测值（行）

选取特定值：

filter1.png

2.1 near() 能用来判断两个值是否相等：

near().png

2.2 逻辑判断：

& is "and", | is "or", and ! is "not". “与或非”
x %in% y: This will select every row where x is one of the values in y.
!(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y.

logical operators.png

filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)

filter2.png

2.3 缺失值：

NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

判断一个值是否是 NA，使用 is.na()。
尝试：

filter(df, is.na(x) | x > 1)

3. `arrange()` 对行进行重排序

默认情况下按升序排列。使用desc() 可以降序排列, NA 值进行排序时候再末尾：

desc.png

4. `select()` 筛选特征值（列）

flights 对象有 19 个特征值，可以直接选择所需要的特征值进行后续分析：

select.png

There are a number of helper functions you can use within select():

starts_with("abc"): matches names that begin with “abc”.

ends_with("xyz"): matches names that end with “xyz”.

contains("ijk"): matches names that contain “ijk”.

matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters.

num_range("x", 1:3): matches x1, x2 and x3.

rename() 函数可以用来重命名变量：

rename.png

5. `mutate()` 添加新变量

一般把新变量添加在数据末尾：

mutate.png

6. `summarise()` 分组统计

能将一整个数据框统计成一行。同时会用到 group_by() 来进行分组：

image.png

【R for Data Science】(3) Data Transformation

1. 安装

2. filter() 筛选观测值（行）

3. arrange() 对行进行重排序

4. select() 筛选特征值（列）

5. mutate() 添加新变量

6. summarise() 分组统计

推荐阅读更多精彩内容

2. `filter()` 筛选观测值（行）

3. `arrange()` 对行进行重排序

4. `select()` 筛选特征值（列）

5. `mutate()` 添加新变量

6. `summarise()` 分组统计