利用ggplot2进行数据可视化

2020-04-25

1.1. first step --意识到ggplot绘制其实是由一层层图层组成，一个命令即可增加一层

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

ggplot()creates a coordinate system 坐标系 that you can add layers图层 to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = mpg) creates an empty graph.

1.2. The function `geom_point()` adds a layer of points to your plot, which creates a scatterplot.

Themapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes.
ggplot()--function； geom_point--function 函数; mapping--argument 参数
增加另一个数据的值：
ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length,color=Species))

ggplot(data=iris)+geom_point(mapping = aes(x=Species,y=Sepal.Length,color=Sepal.Width))

实际上命令可叠加

ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length,size=Species,color=Species))
Warning message:
Using size for a discrete variable is not advised.

1.3. 还可手动设置对象的图形属性

ggplot(data=iris) + geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length,color="grey"))

此处，color设置在aes()内部，意为：将“grey”这个字符串赋予color

ggplot(data=iris) + geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length),color="grey")

此处，color设置于aes()外部，不改变变量信息，只是改变geom_point()散点图的外观

One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven’t accidentally written code like this:

ggplot(data = mpg) 
+ geom_point(mapping = aes(x = displ, y = hwy))

1.4. 还可分面

注意：facet()是和aes()平级的函数

     ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length))+facet_wrap(~Species,nrow=2)

注意：species是离散变量。如果对连续变量sepal.width分面：

>     ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length))+facet_wrap(~Sepal.Width,nrow=4)

对iris数据进行统计：

> p<-iris
> distinct(p,iris)

> distinct(p,Sepal.Length)     #展示非重复数据
   Sepal.Length
1           5.1
2           4.9
3           4.7
4           4.6
5           5.0
6           5.4
7           4.4
8           4.8
9           4.3
10          5.8
11          5.7
12          5.2
13          5.5
14          4.5
15          5.3
16          7.0
17          6.4
18          6.9
19          6.5
20          6.3
21          6.6
22          5.9
23          6.0
24          6.1
25          5.6
26          6.7
27          6.2
28          6.8
29          7.1
30          7.6
31          7.3
32          7.2
33          7.7
34          7.4
35          7.9
> count(p,Sepal.Length)    #统计非重复数据
# A tibble: 35 x 2
   Sepal.Length     n
          <dbl> <int>
 1          4.3     1
 2          4.4     3
 3          4.5     1
 4          4.6     4
 5          4.7     2
 6          4.8     5
 7          4.9     6
 8          5      10
 9          5.1     9
10          5.2     4
# … with 25 more rows

1.5. 比较facet_grid() 一般需要将具有更多唯一值的变量放在列上

ggplot(data=mpg)+
    geom_point(mapping = aes(drv,y=cyl))

> ggplot(data=mpg)+
+     geom_point(mapping = aes(drv,y=cyl))+
+     facet_grid(drv~cyl)

ggplot(data=mpg)+
+     geom_point(mapping = aes(drv,y=cyl))+
+     facet_grid(cyl~drv)

> ggplot(data=mpg)+
+     geom_point(mapping = aes(drv,y=cyl))+
+     facet_grid(drv~.)

关于stroke

 ggplot(data=iris)+
+     geom_point(mapping = aes(x=Sepal.Length,y=Sepal.Width,stroke=1,fill="lightpink",color=Species),shape=21)

放大可见描边内部形状填充了lightpink

1.6. 几何对象

> ggplot(data=iris)+
+     geom_smooth(mapping = aes(x=Sepal.Length,y=Sepal.Width,color=Species))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
>

将相同对象纳入不同命令处理时，可以这样：

> ggplot(data = iris, mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+     geom_point()+
+     geom_smooth()

（当然最基本函数是这样:）

> ggplot(data = iris)+
+     geom_point(mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+     geom_smooth(mapping = aes(x=Sepal.Length,y=Sepal.Width))

二者出图结果一致（这是必然的）

还可以单独对某一函数施加命令：

> ggplot(data = iris, mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+     geom_point(mapping = aes(color=Species))+
+     geom_smooth()

同理，可以对不同图层施加不同数据：局部可以覆盖全局

> ggplot(data = iris, mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+     geom_point(mapping = aes(color=Species),show.legend = F)+
+     geom_smooth(data=filter(iris,Species=="setosa"))

思考题

p1 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5) +
      geom_smooth(se = F, size = 1.5)

p2 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5) +
      geom_smooth(se = F, size = 1.5, mapping = aes(group = drv))

p3 <- ggplot(data = mpg, mapping = aes(displ, hwy, color = drv)) +
      geom_point(size = 2.5) +
      geom_smooth(se = F, size = 1.5, mapping = aes(group = drv, color = drv))

p4 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5, mapping = aes(color = drv)) +
      geom_smooth(se = F, size = 1.5)

p5 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5, mapping = aes(color = drv)) +
      geom_smooth(se = F, size = 1.5, mapping = aes(group = drv, linetype = drv))

p6 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5, mapping = aes(color = drv))

library(gridExtra)     #把几张图排到一起
grid.arrange(p1, p2, p3, p4, p5, p6, ncol= 2, nrow = 3)

1.7. 统计变换

geom_bar
view(diamonds)
geom_bar的默认统计变换是stat_count，stat_count会计算出两个新变量-count（计数）和prop（proportions，比例）。

直方图默认的y轴是x轴的计数。此例子中x轴是是五种cut（切割质量），直方图自动统计了这五种质量的钻石的统计计数，当你不想使用计数，而是想显示各质量等级所占比例的时候就需要用到prop

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

group=1的意思是把所有钻石作为一个整体，显示五种质量的钻石所占比例体现出来。如果不加这一句，就是每种质量的钻石各为一组来计算，那么比例就都是100%，

> ggplot(data = diamonds) + 
+     stat_summary(
+         mapping = aes(x = cut, y = depth),
+         fun.min = min,
+         fun.max = max,
+         fun = median
+     )

stat_summary(
  mapping = NULL,
  data = NULL,
  geom = "pointrange",    #`stat_summary`默认几何对象
  position = "identity",    #`geom_pointrange`的默认统计变换，二者不可逆

因此，对于stat_summary，如果不适用该统计变换函数，而使用几何对象函数：

ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary"
  )

（本图未加error bar）

geom_col针对最常见的柱状图，即既给ggplot映射x值（x值一般是因子型的变量，才能成为柱，而没有成为曲线），也映射y值。
如： ggplot2(data, aes(x = x, y = y)) +
geom_col()
geom_bar针对计数的柱状图，即count, 是只给ggplot映射x值（x也一般是因子）。自动计算x的每个因子所拥有的数据点的个数，将这个个数给与y轴。
区别在于给ggplot是否映射y值。

Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

Complementary geoms and stats

geom	stat
geom_bar()	stat_count()
geom_bin2d()	stat_bin_2d()
geom_boxplot()	stat_boxplot()
geom_contour()	stat_contour()
geom_count()	stat_sum()
geom_density()	stat_density()
geom_density_2d()	stat_density_2d()
geom_hex()	stat_hex()
geom_freqpoly()	stat_bin()
geom_histogram()	stat_bin()
geom_qq_line()	stat_qq_line()
geom_qq()	stat_qq()
geom_quantile()	stat_quantile()
geom_smooth()	stat_smooth()
geom_violin()	stat_violin()
geom_sf()	stat_sf()
geom_pointrange()	stat_identity()

They tend to have their names in common, stat_smooth() and geom_smooth(). However, this is not always the case, with geom_bar() and stat_count() and geom_histogram() and geom_bin() as notable counter-examples.
If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.

ggplot2 geom layers and their default stats

geom	default stat
geom_abline()	-
geom_hline()	-
geom_vline()	-
geom_bar()	stat_count()
geom_col()	-
geom_bin2d()	stat_bin_2d()
geom_blank()	-
geom_boxplot()	stat_boxplot()
geom_countour()	stat_countour()
geom_count()	stat_sum()
geom_density()	stat_density()
geom_density_2d()	stat_density_2d()
geom_dotplot()	-
geom_errorbarh()	-
geom_hex()	stat_hex()
geom_freqpoly()	stat_bin() x
geom_histogram()	-stat_bin() x
geom_crossbar()	-
geom_errorbar()	-
geom_linerange()	-
geom_pointrange()	-
geom_map()	-
geom_point()	-
geom_map()	-
geom_path()	-
geom_line()	-
geom_step()	-
geom_point()	-
geom_polygon()	-
geom_qq_line()	stat_qq_line() x
geom_qq()	stat_qq() x
geom_quantile()	stat_quantile() x
geom_ribbon()	-
geom_area()	-
geom_rug()	-
geom_smooth()	stat_smooth() x
geom_spoke()	-
geom_label()	-
geom_text()	-
geom_raster()	-
geom_rect()	-
geom_tile()	-
geom_violin()	stat_ydensity() x
geom_sf()	stat_sf() x

ggplot2 stat layers and their default geoms

stat	default geom
stat_ecdf()	geom_step()
stat_ellipse()	geom_path()
stat_function()	geom_path()
stat_identity()	geom_point()
stat_summary_2d()	geom_tile()
stat_summary_hex()	geom_hex()
stat_summary_bin()	geom_pointrange()
stat_summary()	geom_pointrange()
stat_unique()	geom_point()
stat_count()	geom_bar()
stat_bin_2d()	geom_tile()
stat_boxplot()	geom_boxplot()
stat_countour()	geom_contour()
stat_sum()	geom_point()
stat_density()	geom_area()
stat_density_2d()	geom_density_2d()
stat_bin_hex()	geom_hex()
stat_bin()	geom_bar()
stat_qq_line()	geom_path()
stat_qq()	geom_point()
stat_quantile()	geom_quantile()
stat_smooth()	geom_smooth()
stat_ydensity()	geom_violin()
stat_sf()	geom_rect()

关于geom_smooth：有3个回归函数
glm是广义线性回归函数,当然你也可以用它来做线性回归
lm是线性回归函数,不能拟合广义线性回归模型
loess

 >p1<-ggplot(mpg, aes(displ, hwy, colour = class)) +
+     geom_point() +
+     geom_smooth( method = glm,se=FALSE)
> p2<-ggplot(mpg, aes(displ, hwy, colour = class)) +
+     geom_point() +
+     geom_smooth( method = lm,se=FALSE)
> p3<-ggplot(mpg, aes(displ, hwy, colour = class)) +
+     geom_point() +
+     geom_smooth( method = loess,se=FALSE)
library(gridExtra)
> grid.arrange(p1,p2,p3,ncol=2,nrow=2)

关于group=1

> p1=ggplot(data = diamonds) +
+     geom_bar(mapping = aes(x = cut, y = ..prop..))
> p2=ggplot(data = diamonds) +
+     geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
> p3=ggplot(data = diamonds) +
+     geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..,group=1))
> grid.arrange(p1,p2,p3,ncol=2,nrow=2)

因为纵轴是..prop..，即分类变量中每个类别占总量的比，group=1就是将这些类别当作一组的这样一个整体去分别计算各个类别的占比，所以须有group=1。
否则，默认的就是各个类别各自一个“组”，在计数时就是普通的条形图，而在计算占比时每个类别都是百分百占比，所以每个条形图都是顶头的一样高。既第一条代码所画的图片。
若是还有填充的映射，如fill=color，则每种颜色代表的color的一个分类在每个条形图中都是高度为1，7种颜色堆叠在一起，纵坐标的顶头都是7。既第二条代码所画的图片。
作者：咕噜咕噜转的ATP合酶
链接：//www.greatytc.com/p/f36c3f8cfb24

1.8 位置变换

ggplot(data=iris)+    
+ geom_bar(mapping = aes(x=Sepal.Width,y=Sepal.Length,fill=Species),stat="identity")

> p1=ggplot(data=iris)+  
+ geom_bar(mapping = aes(x=Sepal.Width,fill=Species),alpha=3/5,position = "identity")
>  p2=ggplot(data=iris)+    
+ geom_bar(mapping = aes(x=Sepal.Width,fill=Species),alpha=3/5)
> p3=ggplot(data=iris)+    
+  geom_bar(mapping = aes(x=Sepal.Width,color=Species),fill=NA,position = "identity")
> p4=ggplot(data=iris)+    
+ geom_bar(mapping = aes(x=Sepal.Width,fill=Species),position = "fill")
> grid.arrange(p1,p2,p3,p4,ncol=2,nrow=2)
p5=ggplot(data = iris)+
+ geom_bar(mapping=aes(x=Sepal.Width,fill=Species),position="dodge")
> grid.arrange(p1,p2,p3,p4,p5,ncol=2,nrow=3)

仔细比较有无position=“identity”,可以看到，加上position时可使柱状图彼此重叠。（而非堆积）

关于“过绘制”：
默认取整，因此部分重叠的点未能显示

 p6=ggplot(data=iris)+
+     geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length),position="jitter")
> p7=ggplot(data=iris)+
+     geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length))
> grid.arrange(p6,p7,ncol=2,nrow=2)

jitter为每个数据点添加了随机扰动

ggplot(data=iris,mapping = aes(x=Sepal.Width,y=Sepal.Length))+
+ geom_jitter()

也可以生成相同结果

微调jitter

 p8=ggplot(data=mpg,mapping=aes(x=cty,y=hwy))+
+     geom_jitter(aes(color=class))
>p <- ggplot(mpg, aes(cyl, hwy)) 
p9 <- p+geom_jitter(aes(color=class))
> grid.arrange(p8,p9,ncol=2,nrow=2)
p10=ggplot(data=mpg,mapping=aes(x=cyl,y=hwy))+
+     geom_jitter(aes(color=class))
> grid.arrange(p8,p9,p10,ncol=2,nrow=2)

Compare and contrast geom_jitter() with geom_count().

The geom geom_jitter() adds random variation to the locations points of the graph. In other words, it “jitters” the locations of points slightly. This method reduces overplotting since two points with the same location are unlikely to have the same random variation.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter()

However, the reduction in overlapping comes at the cost of slightly changing the x and y values of the points.

The geom geom_count() sizes the points relative to the number of observations. Combinations of (x, y) values with more observations will be larger than those with fewer observations.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_count()

The geom_count() geom does not change x and y coordinates of the points. However, if the points are close together and counts are large, the size of some points can itself create overplotting. For example, in the following example, a third variable mapped to color is added to the plot. In this case, geom_count() is less readable than geom_jitter() when adding a third variable as a color aesthetic.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
  geom_jitter()

image

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
  geom_count()

image

As that example shows,unfortunately, there is no universal solution to overplotting. The costs and benefits of different approaches will depend on the structure of the data and the goal of the data scientist.

1.9 坐标系

coord_flip--置换X Y轴
coord_quickmap--为地图选择合适纵横比
coord_polar--极坐标系

usa<-map_data("usa")
nz<-map_data("nz")
ggplot(usa, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", color = "black") +
  coord_quickmap()

ggplot(iris, aes(x = factor(1), fill = Species)) +
  geom_bar()
ggplot(iris, aes(x = factor(1), fill = Species)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y")

The argument theta = "y" maps y to the angle of each section. If coord_polar() is specified without theta = "y", then the resulting plot is called a bulls-eye chart.

最后编辑于：2020.05.18 21:18:41

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 211,743评论 6赞 492
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 90,296评论 3赞 385
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 157,285评论 0赞 348
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 56,485评论 1赞 283
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 65,581评论 6赞 386
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,821评论 1赞 290
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,960评论 3赞 408
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 37,719评论 0赞 266
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,186评论 1赞 303
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 36,516评论 2赞 327
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,650评论 1赞 340
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 34,329评论 4赞 330
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,936评论 3赞 313
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,757评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,991评论 1赞 266
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 46,370评论 2赞 360
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 43,527评论 2赞 349