R_for_Data_Science_Tibble&Import

这是英文版的第9、10、11章

第9章是Introduction，没啥好讲的

第11章import部分我一点都不熟，也没啥好讲的……

10.2 Creating tibbles

把一个数据框转as_tibble

越来越感觉tibble是个好东西

as_tibble(iris)
#> # A tibble: 150 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa 
#> 5          5           3.6          1.4         0.2 setosa 
#> 6          5.4         3.9          1.7         0.4 setosa 
#> # … with 144 more rows

但有个小问题，就是as_tibble转换的时候会把列名给弄没了。

其实这也是tidyverse一贯的思想吧？没有列名。包括你像readr读文件，left_join合并数据集等

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1


> mtcars %>% 
+   as_tibble()
# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# … with 22 more rows

这时候你就可以把列名变成单独的一列

> mtcars %>% 
+   as_tibble(rownames = "myrowname") 
# A tibble: 32 x 12
   myrowname         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>           <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4        21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 Wag    21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710       22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 Drive   21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Sportab…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant          18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360       14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D        24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230         22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280         19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# … with 22 more rows

还可以

> mtcars %>% 
+   as_tibble(rownames = NA) %>% 
+   rownames_to_column(var = "myrowname")
# A tibble: 32 x 12
   myrowname         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>           <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4        21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 Wag    21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710       22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 Drive   21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Sportab…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant          18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360       14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D        24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230         22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280         19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# … with 22 more rows

用tibble函数自建tibble对象

tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)
#> # A tibble: 5 x 3
#>       x     y     z
#>   <int> <dbl> <dbl>
#> 1     1     1     2
#> 2     2     1     5
#> 3     3     1    10
#> 4     4     1    17
#> 5     5     1    26

If you’re already familiar with data.frame(), note that tibble() does much less:

it never changes the type of the inputs (e.g. it never converts strings to factors!), (R 4.0 终于不会默认把字符串变成因子了)

it never changes the names of variables,(这应该指的是你如果列名是1的话，就会变成X1)
> data.frame(`1` = 1:5, `2` = 1:5)
  X1 X2
1  1  1
2  2  2
3  3  3
4  4  4
5  5  5

> tibble(`1` = 1:5, `2` = 1:5)
# A tibble: 5 x 2
    `1`   `2`
  <int> <int>
1     1     1
2     2     2
3     3     3
4     4     4
5     5     5
and it never creates row names.

It’s possible for a tibble to have column names that are not valid R variable names, aka non-syntactic names. For example, they might not start with a letter, or they might contain unusual characters like a space. To refer to these variables, you need to surround them with backticks, ```:

tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)
tb
#> # A tibble: 1 x 3
#>   `:)`  ` `   `2000`
#>   <chr> <chr> <chr> 
#> 1 smile space number

Another way to create a tibble is with tribble()

Another way to create a tibble is with tribble(), short for transposed tibble. tribble() is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.

tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)
#> # A tibble: 2 x 3
#>   x         y     z
#>   <chr> <dbl> <dbl>
#> 1 a         2   3.6
#> 2 b         1   8.5

I often add a comment (the line starting with #), to make it really clear where the header is.

tribble还可以有下面的骚操作

# tribble will create a list column if the value in any cell is
# not a scalar
tribble(
~x,  ~y,
"a", 1:3,
"b", 4:6
)
#> # A tibble: 2 x 2
#>   x     y        
#>   <chr> <list>   
#> 1 a     <int [3]>
#> 2 b     <int [3]>

参考

Row-wise tibble creation

tibble做上面tribble的骚操作

> data.frame(x = c("a","b"),
+            y = I(list(1:3,4:6))) %>% 
+   as_tibble()
# A tibble: 2 x 2
  x     y        
  <fct> <I<list>>
1 a     <int [3]>
2 b     <int [3]>

> tibble(x = c("a","b"),
+        y = list(1:3,4:6))
# A tibble: 2 x 2
  x     y        
  <chr> <list>   
1 a     <int [3]>
2 b     <int [3]>

参考

Create a data.frame where a column is a list

想到一个有意思的包，usethis

library(usethis)

# Want to print friendly output to a user in a package (or to yourself in your own code?)
# The usethis ui_*() functions are perfect!

# Use ui_done() when something is done, like a file saved
ui_done("File saved at...")

## ✓ File saved at...

# ui_todo() is useful when you need your user to pay attention and do something!
ui_todo("Changes have been made, please review them!")

## ● Changes have been made, please review them!

# ui_oops() when something went wrong
ui_oops("That should not have happened")

## x That should not have happened

参考：

usethis::ui_done() - i know this one!

10.3 Tibbles vs. data.frame

There are two main differences in the usage of a tibble vs. a classic data.frame: printing and subsetting.

打印

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from str():

tibble(
  a = lubridate::now() + runif(1e3) * 86400,
  b = lubridate::today() + runif(1e3) * 30,
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)
#> # A tibble: 1,000 x 5
#>   a                   b              c     d e    
#>   <dttm>              <date>     <int> <dbl> <chr>
#> 1 2020-01-15 20:43:23 2020-01-22     1 0.368 n    
#> 2 2020-01-16 14:48:32 2020-01-27     2 0.612 l    
#> 3 2020-01-16 09:12:12 2020-02-06     3 0.415 p    
#> 4 2020-01-15 22:33:29 2020-02-05     4 0.212 m    
#> 5 2020-01-15 18:57:45 2020-02-02     5 0.733 i    
#> 6 2020-01-16 05:58:42 2020-01-29     6 0.460 n    
#> # … with 994 more rows

First, you can explicitly print() the data frame and control the number of rows (n) and the width of the display. width = Inf will display all columns:

nycflights13::flights %>% 
  print(n = 10, width = Inf)

You can also control the default print behaviour by setting options:

options(tibble.print_max = n, tibble.print_min = m): if more than n rows, print only m rows. Use options(tibble.print_min = Inf) to always show all rows.
Use options(tibble.width = Inf) to always print all columns, regardless of the width of the screen.

提取

So far all the tools you’ve learned have worked with complete data frames. If you want to pull out a single variable, you need some new tools, $ and [[. [[ can extract by name or position; $ only extracts by name but is a little less typing.

df <- tibble(
  x = runif(5),
  y = rnorm(5)
)

# Extract by name
df$x
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
df[["x"]]
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605

# Extract by position
df[[1]]
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605

To use these in a pipe, you’ll need to use the special placeholder .:

df %>% .$x
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
df %>% .[["x"]]
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605

data.frame也可以

df <- data.frame(
x = runif(5),
y = rnorm(5)
)

> df %>% 
+   .$x
[1] 0.03347872 0.27371447 0.96202331 0.78821730 0.32745451
> df %>% 
+   .[["x"]]
[1] 0.03347872 0.27371447 0.96202331 0.78821730 0.32745451
> df %>% 
+   .[[1]]
[1] 0.03347872 0.27371447 0.96202331 0.78821730 0.32745451

# 2其实你还可以这样子
df %>% 
  "[["("x")

Compared to a data.frame, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.

关于部分匹配的例子

df1 <- data.frame(xyz = "a")
df2 <- tibble(xyz = "a")

str(df1$x)
#>  Factor w/ 1 level "a": 1
str(df2$x)
#> Warning: Unknown or uninitialised column: 'x'.
#>  NULL

参考

Advanced_R-3.6.4 Subsetting

部分匹配的另一个例子

$ is a shorthand operator: x$y is roughly equivalent to x[["y"]]. It’s often used to access variables in a data frame, as in mtcars$cyl or diamonds$carat. One common mistake with $ is to use it when you have the name of a column stored in a variable:
var <- "cyl"
# Doesn't work - mtcars$var translated to mtcars[["var"]]
mtcars$var
#> NULL

# Instead use [[
mtcars[[var]]
#>  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
The one important difference between $ and [[ is that $ does (left-to-right) partial matching:
x <- list(abc = 1)
x$a
#> [1] 1
x[["a"]]
#> NULL
To help avoid this behaviour I highly recommend setting the global option warnPartialMatchDollar to TRUE:
options(warnPartialMatchDollar = TRUE)
x$a
#> Warning in x$a: partial match of 'a' to 'abc'
#> [1] 1
(For data frames, you can also avoid this problem by using tibbles, which never do partial matching.)

参考

Advanced_R-4.3.2 $

10.4 Interacting with older code

Some older functions don’t work with tibbles. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data.frame:

class(as.data.frame(tb))
#> [1] "data.frame"

The main reason that some older functions don’t work with tibble is the [ function. We don’t use [ much in this book because dplyr::filter() and dplyr::select() allow you to solve the same problems with clearer code (but you will learn a little about it in vector subsetting). With base R data frames, [ sometimes returns a data frame, and sometimes returns a vector. With tibbles, [ always returns another tibble.

对于data.frame来说，如果你用[]选取了一列，那么其就会自动转换成向量了

df <- data.frame(
  x = runif(5),
  y = rnorm(5)
)

> df[, "x"]
[1] 0.02206585 0.98926964 0.95333742 0.79946273 0.19327569
> df[, c("x","y")]
           x           y
1 0.02206585 -1.32245311
2 0.98926964  0.59576966
3 0.95333742  0.03922984
4 0.79946273  1.09332833
5 0.19327569  0.88358188

# 如果你想阻止这种行为
# 加一个drop=F
> df[, "x", drop = F]
           x
1 0.55692342
2 0.06739173
3 0.08648150
4 0.84341912
5 0.93941534

但对于tibble而言

df <- tibble(
  x = runif(5),
  y = rnorm(5)
)

> df[, "x"]
# A tibble: 5 x 1
      x
  <dbl>
1 0.422
2 0.519
3 0.881
4 0.114
5 0.956
> df[, c("x","y")]
# A tibble: 5 x 2
      x      y
  <dbl>  <dbl>
1 0.422 -1.03 
2 0.519  0.605
3 0.881  0.414
4 0.114  0.820
5 0.956 -0.391

Exercise 10.1

How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame).

You can use the function is_tibble() to check whether a data frame is a tibble or not. The mtcars data frame is not a tibble.

is_tibble(mtcars)
#> [1] FALSE

But the diamonds and flights data are tibbles.

is_tibble(ggplot2::diamonds)
#> [1] TRUE
is_tibble(nycflights13::flights)
#> [1] TRUE
is_tibble(as_tibble(mtcars))
#> [1] TRUE

More generally, you can use the class() function to find out the class of an object. Tibbles has the classes c("tbl_df", "tbl", "data.frame"), while old data frames will only have the class "data.frame".

class(mtcars)
#> [1] "data.frame"
class(ggplot2::diamonds)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(nycflights13::flights)
#> [1] "tbl_df"     "tbl"        "data.frame"

If you are interested in reading more on R’s classes, read the chapters on object oriented programming in Advanced R.

Advanced R虽然看到 object oriented programming那里劝退了，但真的写的极好

Exercise 10.2

Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data frame behaviors cause you frustration?

df <- data.frame(abc = 1, xyz = "a")
df$x
#> [1] a
#> Levels: a
df[, "xyz"]
#> [1] a
#> Levels: a
df[, c("abc", "xyz")]
#>   abc xyz
#> 1   1   a

tbl <- as_tibble(df)
tbl$x
#> Warning: Unknown or uninitialised column: 'x'.
#> NULL
tbl[, "xyz"]
#> # A tibble: 1 x 1
#>   xyz  
#>   <fct>
#> 1 a
tbl[, c("abc", "xyz")]
#> # A tibble: 1 x 2
#>     abc xyz  
#>   <dbl> <fct>
#> 1     1 a

The $ operator will match any column name that starts with the name following it. Since there is a column named xyz, the expression df$x will be expanded to df$xyz. This behavior of the $ operator saves a few keystrokes, but it can result in accidentally using a different column than you thought you were using.

With data.frames, with [ the type of object that is returned differs on the number of columns. If it is one column, it won’t return a data.frame, but instead will return a vector. With more than one column, then it will return a data.frame. This is fine if you know what you are passing in, but suppose you did df[ , vars] where vars was a variable. Then what that code does depends on length(vars) and you’d have to write code to account for those situations or risk bugs.

上面是solution的解答

其实综合起来，上面所出现的df和tibble的操作结果差异就是因为df的$操作符的部分匹配特性、[]操作符在选取一列的时候会自动降维为一维向量的特性

Exercise 10.3

If you have the name of a variable stored in an object, e.g. var <- "mpg", how can you extract the reference variable from a tibble?

You can use the double bracket, like df[[var]]. You cannot use the dollar sign, because df$var would look for a column named var.

这种特性可能单个使用没啥用，但在写循环的时候应该会大有用处

试验一下

df <- tibble(
  x = runif(5),
  y = rnorm(5)
)


a <- "x"

df[a]
# A tibble: 5 x 1
       x
   <dbl>
1 0.617 
2 0.971 
3 0.866 
4 0.105 
5 0.0429
> df[, a]
# A tibble: 5 x 1
       x
   <dbl>
1 0.617 
2 0.971 
3 0.866 
4 0.105 
5 0.0429
> df[[a]]
[1] 0.61743318 0.97105570 0.86600921 0.10470175 0.04291076

data.frame似乎也是可以的

df <- data.frame(
  x = runif(5),
  y = rnorm(5)
)


a <- "x"
> df[a]
          x
1 0.5506573
2 0.2944493
3 0.7896432
4 0.6288798
5 0.6678818
> df[, a]
[1] 0.5506573 0.2944493 0.7896432 0.6288798 0.6678818
> df[[a]]
[1] 0.5506573 0.2944493 0.7896432 0.6288798 0.6678818

Exercise 10.4

Practice referring to non-syntactic names in the following data frame by:

Extracting the variable called 1.

Plotting a scatterplot of 1 vs 2.

Creating a new column called 3 which is 2 divided by 1.

Renaming the columns to one, two and three.
annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)

To extract the variable named 1:

annoying[["1"]]
#>  [1]  1  2  3  4  5  6  7  8  9 10

annoying$`1`
#>  [1]  1  2  3  4  5  6  7  8  9 10

Plotting a scatterplot of 1 vs 2.

ggplot(annoying, aes(x = `1`, y = `2`)) +
  geom_point()

To add a new column 3 which is 2 divided by 1:

mutate(annoying, `3` = `2` / `1`)
#> # A tibble: 10 x 3
#>     `1`    `2`   `3`
#>   <int>  <dbl> <dbl>
#> 1     1  0.600 0.600
#> 2     2  4.26  2.13 
#> 3     3  3.56  1.19 
#> 4     4  7.99  2.00 
#> 5     5 10.6   2.12 
#> 6     6 13.1   2.19 
#> # … with 4 more rows

annoying[["3"]] <- annoying$`2` / annoying$`1`

annoying[["3"]] <- annoying[["2"]] / annoying[["1"]]

To rename the columns to one, two, and three, run:

annoying <- rename(annoying, one = `1`, two = `2`, three = `3`)
glimpse(annoying)
#> Observations: 10
#> Variables: 3
#> $ one   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> $ two   <dbl> 0.60, 4.26, 3.56, 7.99, 10.62, 13.15, 12.18, 15.75, 17.76,…
#> $ three <dbl> 0.60, 2.13, 1.19, 2.00, 2.12, 2.19, 1.74, 1.97, 1.97, 1.97

get到一个新函数glimpse（来自tibble包）

This is like a transposed version of print(): columns run down the page, and data runs across. This makes it possible to see every column in a data frame. It's a little like str() applied to a data frame but it tries to show you as much data as possible. (And it always shows the underlying data, even when applied to a remote data source.)

> glimpse(mtcars)
Observations: 32
Variables: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, …
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, …
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 1…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.9…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, …
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, …
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, …
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, …
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, …
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, …

Exercise 10.5

What does tibble::enframe() do? When might you use it?

The function tibble::enframe() converts named vectors to a data frame with names and values

enframe(c(a = 1, b = 2, c = 3))
#> # A tibble: 3 x 2
#>   name  value
#>   <chr> <dbl>
#> 1 a         1
#> 2 b         2
#> 3 c         3

enframe还有个对应的函数deframe

来自Converting vectors to data frames, and vice versa

enframe(1:3)#> # A tibble: 3 x 2
#>    name value
#>   <int> <int>
#> 1     1     1
#> 2     2     2
#> 3     3     3

enframe(c(a = 5, b = 7))#> # A tibble: 2 x 2
#>   name  value
#>   <chr> <dbl>
#> 1 a         5
#> 2 b         7

# 这个效果应该跟上面的tribble的很像
enframe(list(one = 1, two = 2:3, three = 4:6))
#> # A tibble: 3 x 2
#>   name  value    
#>   <chr> <list>   
#> 1 one   <dbl [1]>
#> 2 two   <int [2]>
#> 3 three <int [3]>

tribble(
  ~name, ~value,
  "one", 1,
  "two", 2:3,
  "three", 4:6
)
# A tibble: 3 x 2
  name  value    
  <chr> <list>   
1 one   <dbl [1]>
2 two   <int [2]>
3 three <int [3]>