如何灵活,快速地可视化置信区间、标准误差以及变量的均值和中位数
本节将详细介绍stat_summary函数的应用,喜欢的小伙伴可以关注我的公众号R语言数据分析指南持续分享更多优质资源
原文链接:https://mp.weixin.qq.com/s/v8Vdo8BtKoQdiGrR-GOFYA
加载必需R包
BiocManager::install("gapminder")
BiocManager::install("Hmisc")
library(tidyverse)
library(gapminder)
library(Hmisc)
根据diamonds数据集来创建含有统计信息的条形图:
diamonds %>%
group_by(cut) %>%
summarise(mean = mean(price)) %>%
ggplot(aes(x = cut, y = mean)) +
geom_col()
这种方法有效,但不是最有效的。首先,如果我可以直接使用ggplot2进行计算,则不需要先对数据进行统计。另一方面,计算可能会变得相对复杂,尤其是当我想可视化置信区间时。
stat_summary( )的含义
幸运的是,ggplot2的开发人员已经考虑了如何深入可视化统计信息的问题。解决方案是使用stat_summary函数。我们将使用gapminder数据集,其中包含有不同国家/地区人们的预期寿命的数据。
library(tidyverse)
library(gapminder)
gapminder
> gapminder
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ... with 1,694 more rows
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_col()
如图所见,近几十年来预期寿命有所增加。但是,条形图并未显示所有国家的平均预期寿命或中位数预期寿命,而是把每个国家和年份的预期寿命进行了汇总
但是,可以使用geom_bar计算国家/地区的平均预期寿命。我们要做的就是指定一个要为y轴上的变量进行计算的函数,并另外指定自变量stat = "summary".
https://stackoverflow.com/questions/30183199/ggplot2-plot-mean-with-geom-bar
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_bar(fun = "mean", stat = "summary")
但是我们无法将数据显示为点或线,因为它们是使用geom_bar创建的。这时stat_summary函数的强大之处就体现的淋漓尽致。stat_summary允许我们通过不同的可视化显示任何类型的数据统计信息。无论我们是要可视化点还是线或面,请接着往下看
在此示例中,我们将两个参数传递给stat_summary函数。首先,我们告诉stat_summary fun.y = mean我们想要计算变量lifeExp的平均值。使用参数geom = "bar"我们告诉stat_summary将平均值显示为条形图
我们也可以告诉stat_summary,我们要创建折线图而不是条形图,并添加每年平均值的单个点以提高可视化效果的可读性
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun = "mean", geom = "point") +
stat_summary(fun = "mean", geom = "line")
从此示例中,可以看到我们也可以将几个stat_summaries合并在一起。与上一个示例相比,唯一的变化是我们更改了geom,我们现在使用点和线
此外我们还可以更改需要显示的统计信息。各国之间的预期寿命可能差异很大,因此我们想显示中位数而不是平均值
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun = "median", geom = "bar")
还可以使用stat_summary显示区域而不是直线和点
gapminder %>%
mutate(year = as.integer(year)) %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun = "mean", geom = "area",
fill = "#EB5286",
alpha = .5) +
stat_summary(fun = "mean", geom = "point",
color = "#6F213F")
同理还可以显示各国最高和最低预期寿命
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun = mean,
geom = "pointrange",
fun.min = min,
fun.max = max)
我们还可以使用经典的误差线来显示最大值和最小值
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(geom = "errorbar",
width = 1,
fun.min = min,
fun.max = max)
创建标准偏差
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun = mean,
geom = "pointrange",
fun.max = function(x) mean(x) + sd(x),
fun.min = function(x) mean(x) - sd(x))
创建标准误差
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun = mean,
geom = "pointrange",
fun.max = function(x) mean(x) + sd(x) / sqrt(length(x)),
fun.min = function(x) mean(x) - sd(x) / sqrt(length(x)))
创建经典的是95%置信区间。同样,Hmisc包有一个函数可以用来显示置信区间:mean_cl_normal
和mean_cl_boot
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun.data = "mean_cl_normal")
我们还可以对其添加误差线
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun.data = "mean_cl_normal",
geom = "errorbar",
width = .4) +
stat_summary(fun = "mean", geom = "point")
随意显示置信区间
幸运的是,mean_cl_normal函数具有用于更改置信区间宽度的参数conf.int
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun.data = "mean_cl_normal",
fun.args = list(conf.int = .99))
将统计信息与多个几何对象合并
创建显示具有95%置信区间的条形图
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = continent, y = lifeExp)) +
stat_summary(fun = "mean", geom = "bar", alpha = .7) +
stat_summary(fun = "mean", geom = "point",
size = 1) +
stat_summary(fun.data = "mean_cl_normal",
geom = "errorbar",
width = .2)
position = position_dodge( )并排显示多个条形图
colors <-c("#E41A1C","#1E90FF","#FF8C00","#4DAF4A","#984EA3",
"#40E0D0","#FFC0CB","#00BFFF","#FFDEAD","#90EE90",
"#EE82EE","#00FFFF","#F0A3FF", "#0075DC",
"#993F00","#4C005C","#2BCE48","#FFCC99",
"#808080","#94FFB5","#8F7C00","#9DCC00",
"#C20088","#003380","#FFA405","#FFA8BB",
"#426600","#FF0010","#5EF1F2","#00998F",
"#740AFF","#990000","#FFFF00")
gapminder %>%
mutate(
year = as.factor(year)
) %>%
ggplot(aes(x = continent, y = lifeExp, fill = year)) +
stat_summary(fun = "mean", geom = "bar",
alpha = .7, position = position_dodge(0.95)) +
stat_summary(fun = "mean", geom = "point",
position = position_dodge(0.95),
size = 1) +
stat_summary(fun.data = "mean_cl_normal",
geom = "errorbar",
position = position_dodge(0.95),
width = .2) +
scale_fill_manual(values = colors)+
theme_minimal()+
scale_y_continuous(expand=c(0,0))
尽管已经讨论过geom_()的局限性并证明了stat_()的强大之处,但两者都有自己的位置。这不是非此即彼的问题。实际上,它们彼此需要-就像stat_summary()有一个geom论点,geom_()也有一个stat论点。在更高的层次上,stat_()和geom_*()是layer()构建ggplot函数的便捷实例
引用Hadley的话解释这个错误的二分法
不幸的是,由于早期的设计错误,我将它们称为stat_( )或geom_( )一个更好的决定是将它们称为layer_( )函数:这是一个更准确的描述,因为每一层都包含一个stat和geom.
参考:https://cran.r-project.org/web/packages/ggplot2/vignettes/extending-ggplot2.html