R语言中的attach, transform, mutate 和 within的比较

Comparing Transformation Styles: attach, transform, mutate and within

Posted on January 22, 2013 by Bob Muenchen

There are several ways to perform data transformations in R. Each has its own set of advantages and disadvantages. Let’s take one variable, square it and add 100. How many ways might an R beginner screw up such a simple computation? Quite a few!

Here’s a data frame with one variable:

> mydata <- data.frame(x = 1:5)
> mydata

  x
1 1
2 2
3 3
4 4
5 5

Since the variable x exists only in mydata, to transform x, I must somehow tell R it is stored in mydata. Thesimplest way to do that is using dollar format: mydata$x. I’ll make a copy of the data first so we can do the transformation several ways:

> mydata.new <- mydata

> mydata.new$x2 <- mydata.new$x  ^ 2
> mydata.new$x3 <- mydata.new$x2 + 100

> mydata.new

  x x2  x3
1 1  1 101
2 2  4 104
3 3  9 109
4 4 16 116
5 5 25 125

That works, but I had to type more characters for the “mydata.new" part than I did for the transformation itself. So let’s look at approaches that save us that trouble. One widely used approach is to use theattach function. This function makes a copy of a data frame’s variables in a temporary area that is attached to your search path as separate variables or vectors. That’s nice because you can refer to them simply by their names like “x" instead of “mydata$x". However, the attach function is tricky to use. Here’s the most common mistake made by beginners:

> mydata.new <- mydata
> attach(mydata.new)
> x2 <- x  ^ 2 
> x3 <- x2 + 100

> mydata.new
  x
1 1
2 2
3 3
4 4
5 5

There are no error messages, but the variables are not in the data frame! The attach function allows you to use short names to refer to variables in a data frame, but it does not change where new variables are written. So x2 and x3 are simply in my workspace:

> ls()
[1] "mydata" "mydata.new" "x2" "x3"

> x2; x3

[1]  1  4  9 16 25
[1] 101 104 109 116 125

I’ll fix that, but first I’ll remove x2 and x3 from the workspace and detachmydata.new so we can start fresh.

> rm(x2, x3)
> detach(mydata.new)

We can fix this problem by directing new variables into the data frame using dollar format. So here’s the next thing a beginner is likely to try:

> mydata.new <- mydata
> attach(mydata.new)

> mydata.new$x2 <- x  ^ 2
> mydata.new$x3 <- x2 + 100
  Error: object 'x2' not found 
> detach(mydata.new)</pre>

The variable x2 got created and put into mydata.new. However, when the attempt to create x3 was run, variable x2 could not be found. This is due to the fact that the attached version of the data is a copy that was done in the past, it is not a live connection. Therefore, to refer to simply “x2" you would have to attach mydata.new again. You could also get around this problem by using dollar format in the second equation:

> attach(mydata.new)
> mydata.new$x2 <- x  ^ 2
> mydata.new$x3 <- mydata.new$x2 + 100

> mydata.new
  x x2  x3
1 1  1 101
2 2  4 104
3 3  9 109
4 4 16 116
5 5 25 125

> detach(mydata.new)

That worked, but having to keep track of when you do and don’t need dollar format seems more trouble than it’s worth. In addition, the fact that attach actually makes a copy of the data means that it wastes both time and memory.

The transform function lets you use short variable names on both sides of the equation, and it does not need to make a copy of the data set. Let’s just square x to see how it works.

> mydata.new <- transform(mydata, x2 = x ^ 2)

> mydata.new
  x x2
1 1  1
2 2  4
3 3  9
4 4 16
5 5 25

Notice that when calling the transform function, new variable names like x2 are actually the names of arguments, and the formulas are the values of those arguments. As a result, the equals sign is used instead of the assignment operator “<-".

Eliminating the tedious repetition of“mydata$…"makes the formulas easier to enter, read and debug. However, thetransformfunction has a problem: it is unable to use a variable that it just created. For example:

> mydata.new <- transform(mydata,
+                 x2 = x  ^ 2,
+                 x3 = x2 + 100 )

Error in eval(expr, envir, enclos) : object 'x2' not found

We see that when attempting to create x3 from x2, the variable x2 is not found. It will not exist until the call to transform is complete. In our simple example, x2 may be merely an intermediate step, and we could avoid this problem by calculating x3 directly with one formula: x3 = (x ^ 2) + 100. However, if we really need x2 to exist later as a variable, we would have to run transform twice, once to create x2 and again to create x3 from it.

In the above code, note the comma between the two equations. Since transform uses equations as the values of tranform’s arguments, all equations must be followed by commas, except for the last one, which is followed by the final close parenthesis.

Hadley Wickham’sdplyr package has a very useful function, mutate. It’s very similar to the base transform function but it can use variables that it just created:

> library("dplyr")
> mydata.new <- mutate(mydata,
+                 x2 = x ^ 2,
+                 x3 = x2 + 100
+                 )

> mydata.new 
x x2  x3
1 1  1 101
2 2  4 104
3 3  9 109
4 4 16 116
5 5 25 125

However, mutate does have a limitation: it cannot re-create a variable that it just created. So you can use its new variables only on the right-hand side of your equations. In this next example, rather than create x3, I’ll continue to use the name x2:

> mydata.new <- mutate(mydata,
+                 x2 = x  ^ 2, 
+                 x2 = x2 + 100)

> mydata.new
  x x2
1 1  1
2 2  4
3 3  9
4 4 16
5 5 25

As you can see, mutate kept only the first transformation to x2, ignoring the addition of 100. You might think that reusing the same variable name would be a rare occurrence, but if you are recoding a variable using theifelse function (albeit inefficiently) this situation can arise often. (Avoid that by nesting multiple calls to ifelse, which is also more efficient.)

Finally, we come to thewithinfunction. It uses variables by their short names, saves new variables inside the data frame using short names, and it allows you to use new variables anywhere in calculations. It is built into base R, and it works like this:

> mydata.new <- within(mydata, {
+              x2 <- x  ^ 2
+              x3 <- x2 + 100
+              } )

> mydata.new
  x  x3 x2
1 1 101  1
2 2 104  4
3 3 109  9
4 4 116 16
5 5 125 25

Notice that we’re back to using the assignment operator“<-"and commas are not used between formulas. Multiple formulas must be enclosed in {braces}. Also note that the variables appear in the data frame in reverse order. Variable x3 appears before x2, even though the formula for x2 appeared first.

When I reuse the variable name x2 rather than create a new variable, x3, I still get the right answer:

> mydata.new <- within(mydata, {
+               x2 <- x  ^ 2
+               x2 <- x2 + 100
+               } )

> mydata.new
  x  x2
1 1 101
2 2 104
3 3 109
4 4 116
5 5 125

Since the within function does this example so well, why use anything else? The mutate function shares syntax with dplyr’s summarise function and their combination provides great flexibility when doing transformations or getting summary statistics by groups. Because of this, I use mutate to do this type of task and remember to not transform a variable that I just created!

That covers the main ways to transform variables in R. I hope that by understanding the limitations of each, you’ll avoid common pitfalls and be a more productive R user.

转载文献:http://r4stats.com/2013/01/22/comparing-tranformation-styles/

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 211,265评论 6 490
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,078评论 2 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 156,852评论 0 347
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,408评论 1 283
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,445评论 5 384
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,772评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,921评论 3 406
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,688评论 0 266
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,130评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,467评论 2 325
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,617评论 1 340
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,276评论 4 329
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,882评论 3 312
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,740评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,967评论 1 265
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,315评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,486评论 2 348

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,308评论 0 10
  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 9,435评论 0 23
  • 早起。将卧室、客厅收拾停当后,我便开始去洗漱间打理自己。口腔有点发炎,昨儿一整天里都没有走进厨房。你瞧,这地面上,...
    鱼然然妈妈阅读 216评论 0 1
  • 翦梦阅读 418评论 5 15
  • 今日天寒雨暴, 精英湿身多少。 点赞其精神, 壮志雄心不老。 奔跑,奔跑, 运动健身真好。
    翔遠阅读 276评论 0 1