通常我们的数据不能直接用于可视化处理,因此我们要对它们进行转化整理(transform),比如创建新的变量或重命名变量或者重新整理观测值的顺序等等。
1. 安装
这里使用 nycflights13
和 tidyverse
两个包,其中主要用到 dplyr
包中函数:
library(nycflights13)
library(tidyverse)
nycflights13
中的 flights
数据对象含有 336776 个 2013 年纽约的航班信息:
注意 :
int
stands for integers.dbl
stands for doubles, or real numbers.chr
stands for character vectors, or strings.dttm
stands for date-times (a date + a time).lgl
stands for logical, vectors that contain only TRUE or FALSE.fctr
stands for factors, which R uses to represent categorical variables with fixed possible values.date
stands for dates.
dplyr basicsfilter()
: Pick observations by their values.arrange()
: Reorder the rows.select()
: Pick variables by their names.mutate()
: Create new variables with functions of existing variables.summarise()
: Collapse many values down to a single summary.
2. filter()
筛选观测值(行)
选取特定值:
2.1
near()
能用来判断两个值是否相等:2.2 逻辑判断:
&
is "and",|
is "or", and!
is "not". “与或非”
x %in% y
: This will select every row where x is one of the values in y.
!(x & y)
is the same as!x | !y
, and!(x | y)
is the same as!x & !y
.
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
2.3 缺失值:
NA
represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.
判断一个值是否是 NA
, 使用 is.na()
。
尝试:
filter(df, is.na(x) | x > 1)
3. arrange()
对行进行重排序
默认情况下按升序排列。使用desc()
可以降序排列, NA
值进行排序时候再末尾:
4. select()
筛选特征值(列)
flights
对象有 19 个特征值,可以直接选择所需要的特征值进行后续分析:
There are a number of helper functions you can use within
select()
:
starts_with("abc")
: matches names that begin with “abc”.ends_with("xyz")
: matches names that end with “xyz”.contains("ijk")
: matches names that contain “ijk”.matches("(.)\\1")
: selects variables that match a regular expression. This one matches any variables that contain repeated characters.num_range("x", 1:3)
: matchesx1
,x2
andx3
.
rename()
函数可以用来重命名变量:
5. mutate()
添加新变量
一般把新变量添加在数据末尾:
6. summarise()
分组统计
能将一整个数据框统计成一行。同时会用到 group_by()
来进行分组: