这是英文版的第9、10、11章
第9章是Introduction,没啥好讲的
第11章import部分我一点都不熟,也没啥好讲的……
10.2 Creating tibbles
- 把一个数据框转as_tibble
越来越感觉tibble是个好东西
as_tibble(iris)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> # … with 144 more rows
但有个小问题,就是as_tibble转换的时候会把列名给弄没了。
其实这也是tidyverse一贯的思想吧?没有列名。包括你像readr读文件,left_join合并数据集等
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> mtcars %>%
+ as_tibble()
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
这时候你就可以把列名变成单独的一列
> mtcars %>%
+ as_tibble(rownames = "myrowname")
# A tibble: 32 x 12
myrowname mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Sportab… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
还可以
> mtcars %>%
+ as_tibble(rownames = NA) %>%
+ rownames_to_column(var = "myrowname")
# A tibble: 32 x 12
myrowname mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Sportab… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
- 用tibble函数自建tibble对象
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
#> # A tibble: 5 x 3
#> x y z
#> <int> <dbl> <dbl>
#> 1 1 1 2
#> 2 2 1 5
#> 3 3 1 10
#> 4 4 1 17
#> 5 5 1 26
If you’re already familiar with
data.frame()
, note thattibble()
does much less:
- it never changes the type of the inputs (e.g. it never converts strings to factors!), (R 4.0 终于不会默认把字符串变成因子了)
- it never changes the names of variables,(这应该指的是你如果列名是1的话,就会变成X1)
> data.frame(`1` = 1:5, `2` = 1:5) X1 X2 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 > tibble(`1` = 1:5, `2` = 1:5) # A tibble: 5 x 2 `1` `2` <int> <int> 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
- and it never creates row names.
It’s possible for a tibble to have column names that are not valid R variable names, aka non-syntactic names. For example, they might not start with a letter, or they might contain unusual characters like a space. To refer to these variables, you need to surround them with backticks, ```:
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
#> # A tibble: 1 x 3
#> `:)` ` ` `2000`
#> <chr> <chr> <chr>
#> 1 smile space number
- Another way to create a tibble is with
tribble()
Another way to create a tibble is with tribble()
, short for transposed tibble. tribble()
is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~
), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
#> # A tibble: 2 x 3
#> x y z
#> <chr> <dbl> <dbl>
#> 1 a 2 3.6
#> 2 b 1 8.5
I often add a comment (the line starting with #
), to make it really clear where the header is.
tribble还可以有下面的骚操作
# tribble will create a list column if the value in any cell is # not a scalar tribble( ~x, ~y, "a", 1:3, "b", 4:6 ) #> # A tibble: 2 x 2 #> x y #> <chr> <list> #> 1 a <int [3]> #> 2 b <int [3]>
参考
tibble做上面tribble的骚操作
> data.frame(x = c("a","b"), + y = I(list(1:3,4:6))) %>% + as_tibble() # A tibble: 2 x 2 x y <fct> <I<list>> 1 a <int [3]> 2 b <int [3]> > tibble(x = c("a","b"), + y = list(1:3,4:6)) # A tibble: 2 x 2 x y <chr> <list> 1 a <int [3]> 2 b <int [3]>
参考
想到一个有意思的包,usethis
library(usethis) # Want to print friendly output to a user in a package (or to yourself in your own code?) # The usethis ui_*() functions are perfect! # Use ui_done() when something is done, like a file saved ui_done("File saved at...") ## ✓ File saved at... # ui_todo() is useful when you need your user to pay attention and do something! ui_todo("Changes have been made, please review them!") ## ● Changes have been made, please review them! # ui_oops() when something went wrong ui_oops("That should not have happened") ## x That should not have happened
参考:
10.3 Tibbles vs. data.frame
There are two main differences in the usage of a tibble vs. a classic data.frame
: printing and subsetting.
- 打印
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from str()
:
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
#> # A tibble: 1,000 x 5
#> a b c d e
#> <dttm> <date> <int> <dbl> <chr>
#> 1 2020-01-15 20:43:23 2020-01-22 1 0.368 n
#> 2 2020-01-16 14:48:32 2020-01-27 2 0.612 l
#> 3 2020-01-16 09:12:12 2020-02-06 3 0.415 p
#> 4 2020-01-15 22:33:29 2020-02-05 4 0.212 m
#> 5 2020-01-15 18:57:45 2020-02-02 5 0.733 i
#> 6 2020-01-16 05:58:42 2020-01-29 6 0.460 n
#> # … with 994 more rows
First, you can explicitly print()
the data frame and control the number of rows (n
) and the width
of the display. width = Inf
will display all columns:
nycflights13::flights %>%
print(n = 10, width = Inf)
You can also control the default print behaviour by setting options:
-
options(tibble.print_max = n, tibble.print_min = m)
: if more thann
rows, print onlym
rows. Useoptions(tibble.print_min = Inf)
to always show all rows. - Use
options(tibble.width = Inf)
to always print all columns, regardless of the width of the screen.
- 提取
So far all the tools you’ve learned have worked with complete data frames. If you want to pull out a single variable, you need some new tools, $
and [[
. [[
can extract by name or position; $
only extracts by name but is a little less typing.
df <- tibble(
x = runif(5),
y = rnorm(5)
)
# Extract by name
df$x
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
df[["x"]]
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
# Extract by position
df[[1]]
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
To use these in a pipe, you’ll need to use the special placeholder .
:
df %>% .$x
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
df %>% .[["x"]]
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
data.frame也可以
df <- data.frame( x = runif(5), y = rnorm(5) ) > df %>% + .$x [1] 0.03347872 0.27371447 0.96202331 0.78821730 0.32745451 > df %>% + .[["x"]] [1] 0.03347872 0.27371447 0.96202331 0.78821730 0.32745451 > df %>% + .[[1]] [1] 0.03347872 0.27371447 0.96202331 0.78821730 0.32745451 # 2其实你还可以这样子 df %>% "[["("x")
Compared to a data.frame
, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.
关于部分匹配的例子
df1 <- data.frame(xyz = "a") df2 <- tibble(xyz = "a") str(df1$x) #> Factor w/ 1 level "a": 1 str(df2$x) #> Warning: Unknown or uninitialised column: 'x'. #> NULL
参考
部分匹配的另一个例子
$
is a shorthand operator:x$y
is roughly equivalent tox[["y"]]
. It’s often used to access variables in a data frame, as inmtcars$cyl
ordiamonds$carat
. One common mistake with$
is to use it when you have the name of a column stored in a variable:var <- "cyl" # Doesn't work - mtcars$var translated to mtcars[["var"]] mtcars$var #> NULL # Instead use [[ mtcars[[var]] #> [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
The one important difference between
$
and[[
is that$
does (left-to-right) partial matching:x <- list(abc = 1) x$a #> [1] 1 x[["a"]] #> NULL
To help avoid this behaviour I highly recommend setting the global option
warnPartialMatchDollar
toTRUE
:options(warnPartialMatchDollar = TRUE) x$a #> Warning in x$a: partial match of 'a' to 'abc' #> [1] 1
(For data frames, you can also avoid this problem by using tibbles, which never do partial matching.)
参考
10.4 Interacting with older code
Some older functions don’t work with tibbles. If you encounter one of these functions, use as.data.frame()
to turn a tibble back to a data.frame
:
class(as.data.frame(tb))
#> [1] "data.frame"
The main reason that some older functions don’t work with tibble is the [
function. We don’t use [
much in this book because dplyr::filter()
and dplyr::select()
allow you to solve the same problems with clearer code (but you will learn a little about it in vector subsetting). With base R data frames, [
sometimes returns a data frame, and sometimes returns a vector. With tibbles, [
always returns another tibble.
对于data.frame来说,如果你用[]选取了一列,那么其就会自动转换成向量了
df <- data.frame( x = runif(5), y = rnorm(5) ) > df[, "x"] [1] 0.02206585 0.98926964 0.95333742 0.79946273 0.19327569 > df[, c("x","y")] x y 1 0.02206585 -1.32245311 2 0.98926964 0.59576966 3 0.95333742 0.03922984 4 0.79946273 1.09332833 5 0.19327569 0.88358188 # 如果你想阻止这种行为 # 加一个drop=F > df[, "x", drop = F] x 1 0.55692342 2 0.06739173 3 0.08648150 4 0.84341912 5 0.93941534
但对于tibble而言
df <- tibble( x = runif(5), y = rnorm(5) ) > df[, "x"] # A tibble: 5 x 1 x <dbl> 1 0.422 2 0.519 3 0.881 4 0.114 5 0.956 > df[, c("x","y")] # A tibble: 5 x 2 x y <dbl> <dbl> 1 0.422 -1.03 2 0.519 0.605 3 0.881 0.414 4 0.114 0.820 5 0.956 -0.391
-
Exercise 10.1
How can you tell if an object is a tibble? (Hint: try printing
mtcars
, which is a regular data frame).
You can use the function is_tibble()
to check whether a data frame is a tibble or not. The mtcars
data frame is not a tibble.
is_tibble(mtcars)
#> [1] FALSE
But the diamonds
and flights
data are tibbles.
is_tibble(ggplot2::diamonds)
#> [1] TRUE
is_tibble(nycflights13::flights)
#> [1] TRUE
is_tibble(as_tibble(mtcars))
#> [1] TRUE
More generally, you can use the class()
function to find out the class of an object. Tibbles has the classes c("tbl_df", "tbl", "data.frame")
, while old data frames will only have the class "data.frame"
.
class(mtcars)
#> [1] "data.frame"
class(ggplot2::diamonds)
#> [1] "tbl_df" "tbl" "data.frame"
class(nycflights13::flights)
#> [1] "tbl_df" "tbl" "data.frame"
If you are interested in reading more on R’s classes, read the chapters on object oriented programming in Advanced R.
Advanced R虽然看到 object oriented programming那里劝退了,但真的写的极好
-
Exercise 10.2
Compare and contrast the following operations on a
data.frame
and equivalent tibble. What is different? Why might the default data frame behaviors cause you frustration?
df <- data.frame(abc = 1, xyz = "a")
df$x
#> [1] a
#> Levels: a
df[, "xyz"]
#> [1] a
#> Levels: a
df[, c("abc", "xyz")]
#> abc xyz
#> 1 1 a
tbl <- as_tibble(df)
tbl$x
#> Warning: Unknown or uninitialised column: 'x'.
#> NULL
tbl[, "xyz"]
#> # A tibble: 1 x 1
#> xyz
#> <fct>
#> 1 a
tbl[, c("abc", "xyz")]
#> # A tibble: 1 x 2
#> abc xyz
#> <dbl> <fct>
#> 1 1 a
The $
operator will match any column name that starts with the name following it. Since there is a column named xyz
, the expression df$x
will be expanded to df$xyz
. This behavior of the $
operator saves a few keystrokes, but it can result in accidentally using a different column than you thought you were using.
With data.frames, with [
the type of object that is returned differs on the number of columns. If it is one column, it won’t return a data.frame, but instead will return a vector. With more than one column, then it will return a data.frame. This is fine if you know what you are passing in, but suppose you did df[ , vars]
where vars
was a variable. Then what that code does depends on length(vars)
and you’d have to write code to account for those situations or risk bugs.
上面是solution的解答
其实综合起来,上面所出现的df和tibble的操作结果差异就是因为df的$操作符的部分匹配特性、[]操作符在选取一列的时候会自动降维为一维向量的特性
-
Exercise 10.3
If you have the name of a variable stored in an object, e.g.
var <- "mpg"
, how can you extract the reference variable from a tibble?
You can use the double bracket, like df[[var]]
. You cannot use the dollar sign, because df$var
would look for a column named var
.
这种特性可能单个使用没啥用,但在写循环的时候应该会大有用处
试验一下
df <- tibble( x = runif(5), y = rnorm(5) ) a <- "x" df[a] # A tibble: 5 x 1 x <dbl> 1 0.617 2 0.971 3 0.866 4 0.105 5 0.0429 > df[, a] # A tibble: 5 x 1 x <dbl> 1 0.617 2 0.971 3 0.866 4 0.105 5 0.0429 > df[[a]] [1] 0.61743318 0.97105570 0.86600921 0.10470175 0.04291076
data.frame似乎也是可以的
df <- data.frame( x = runif(5), y = rnorm(5) ) a <- "x" > df[a] x 1 0.5506573 2 0.2944493 3 0.7896432 4 0.6288798 5 0.6678818 > df[, a] [1] 0.5506573 0.2944493 0.7896432 0.6288798 0.6678818 > df[[a]] [1] 0.5506573 0.2944493 0.7896432 0.6288798 0.6678818
- Exercise 10.4
Practice referring to non-syntactic names in the following data frame by:
- Extracting the variable called 1.
- Plotting a scatterplot of 1 vs 2.
- Creating a new column called 3 which is 2 divided by 1.
- Renaming the columns to one, two and three.
annoying <- tibble( `1` = 1:10, `2` = `1` * 2 + rnorm(length(`1`)) )
To extract the variable named 1
:
annoying[["1"]]
#> [1] 1 2 3 4 5 6 7 8 9 10
or
annoying$`1`
#> [1] 1 2 3 4 5 6 7 8 9 10
Plotting a scatterplot of 1 vs 2.
ggplot(annoying, aes(x = `1`, y = `2`)) +
geom_point()
To add a new column 3
which is 2
divided by 1
:
mutate(annoying, `3` = `2` / `1`)
#> # A tibble: 10 x 3
#> `1` `2` `3`
#> <int> <dbl> <dbl>
#> 1 1 0.600 0.600
#> 2 2 4.26 2.13
#> 3 3 3.56 1.19
#> 4 4 7.99 2.00
#> 5 5 10.6 2.12
#> 6 6 13.1 2.19
#> # … with 4 more rows
or
annoying[["3"]] <- annoying$`2` / annoying$`1`
or
annoying[["3"]] <- annoying[["2"]] / annoying[["1"]]
To rename the columns to one
, two
, and three
, run:
annoying <- rename(annoying, one = `1`, two = `2`, three = `3`)
glimpse(annoying)
#> Observations: 10
#> Variables: 3
#> $ one <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> $ two <dbl> 0.60, 4.26, 3.56, 7.99, 10.62, 13.15, 12.18, 15.75, 17.76,…
#> $ three <dbl> 0.60, 2.13, 1.19, 2.00, 2.12, 2.19, 1.74, 1.97, 1.97, 1.97
get到一个新函数glimpse(来自tibble包)
This is like a transposed version of print(): columns run down the page, and data runs across. This makes it possible to see every column in a data frame. It's a little like str() applied to a data frame but it tries to show you as much data as possible. (And it always shows the underlying data, even when applied to a remote data source.)
> glimpse(mtcars) Observations: 32 Variables: 11 $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.… $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, … $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, … $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 1… $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.9… $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, … $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, … $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, … $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, … $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, … $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, …
-
Exercise 10.5
What does
tibble::enframe()
do? When might you use it?
The function tibble::enframe()
converts named vectors to a data frame with names and values
enframe(c(a = 1, b = 2, c = 3))
#> # A tibble: 3 x 2
#> name value
#> <chr> <dbl>
#> 1 a 1
#> 2 b 2
#> 3 c 3
enframe还有个对应的函数deframe
来自Converting vectors to data frames, and vice versa
enframe(1:3)#> # A tibble: 3 x 2 #> name value #> <int> <int> #> 1 1 1 #> 2 2 2 #> 3 3 3 enframe(c(a = 5, b = 7))#> # A tibble: 2 x 2 #> name value #> <chr> <dbl> #> 1 a 5 #> 2 b 7 # 这个效果应该跟上面的tribble的很像 enframe(list(one = 1, two = 2:3, three = 4:6)) #> # A tibble: 3 x 2 #> name value #> <chr> <list> #> 1 one <dbl [1]> #> 2 two <int [2]> #> 3 three <int [3]> tribble( ~name, ~value, "one", 1, "two", 2:3, "three", 4:6 ) # A tibble: 3 x 2 name value <chr> <list> 1 one <dbl [1]> 2 two <int [2]> 3 three <int [3]>