R:适合显示具有偏斜计数的数据的图 [英] R: suitable plot to display data with skewed counts

查看:27
本文介绍了R:适合显示具有偏斜计数的数据的图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有类似的数据:

Name     Count
Object1  110
Object2  111
Object3  95
Object4  40
...
Object2000 1

因此,只有前3个对象具有较高的计数,1996年的其余对象少于40个,大多数小于10.我使用 ggplot 条形图绘制此数据,例如:

So only the first 3 objects have high counts, the rest 1996 objects have fewer than 40, with the majority less than 10. I am plotting this data with ggplot bar like:

ggplot(data=object_count, mapping = aes(x=object, y=count)) +
  geom_bar(stat="identity") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

我的情节如下.如您所见,由于有许多低计数的对象,因此图形的宽度非常长,条形的宽度很小,这对于高计数对象几乎是不可见的.有没有更好的方法来表示此数据?我的目标是显示一些最高计数的对象,并显示许多低计数的对象.有没有办法将低计数者归为一类?

My plot is as below. As you can see, because there are so many objects with low counts, the width of the graph is very long, and the width of the bar is tiny, which is almost invisible for the hight-counts objects. Is there a better way to represent this data? My goal is to show a few top-count objects and to show there are many low-count ones. Is there a way to group the low count ones together?

推荐答案

我的猜测是您的数据看起来像这样:

My guess is that your data looks something like this:

set.seed(1)
object_count <- tibble(
  obj_num = 1:2000,
  object = paste0("Object", obj_num),
  count = ceiling(20 * rpois(2000, 10) / obj_num)
)
head(object_count)
## A tibble: 6 x 3
#   obj_num object  count
#     <int> <chr>   <dbl>
#1        1 Object1   160
#2        2 Object2   100
#3        3 Object3    46
#4        4 Object4    55
#5        5 Object5    56
#6        6 Object6    40

果然,当我用 ggplot(object_count,aes(object,count))+ geom_col()+ [theme stuff] 进行绘制时,我得到了一个相似的数字.

Sure enough, when I plot that with ggplot(object_count, aes(object, count)) + geom_col() + [theme stuff] I get a similar figure.

这里有一些策略显示一些最高计数的对象并显示许多低计数的对象."

Here are some strategies "to show a few top-count objects and to show there are many low-count ones."

此处的香草直方图可能无法澄清,因为重要的大价值出现的频率显着降低,而且不够突出:

A vanilla histogram might not be clarifying here, since the important big values appear dramatically less often and would not be prominent enough:

ggplot(object_count, aes(count)) +
  geom_histogram() 

但是我们可以通过转换y轴来改变它,从而将重点放在较小的值上. pseudo_log 转换非常适合此操作,因为它的工作原理类似于大值的对数转换,但是在-1到1之间呈线性关系.在这种情况下,我们可以清楚地看到仅一个外观的异常值在哪里,而且还看到还有更多的小值.如果大值的具体值不如其一般范围那么重要,则可以将此处的 binwidth = 1 设置为更宽的范围.

But we could change that by transforming the y axis to bring more emphasis to small values. The pseudo_log transformation is nice for that since it works like a log transform for large values, but linearly near -1 to 1. In this view, we can clearly see where the outliers with just one appearance are, but also see that there are many more small values. The binwidth = 1 here could be set to something wider if the specific values of the big values aren't as important as their general range.

ggplot(object_count, aes(count)) +
  geom_histogram(binwidth = 1) +
  scale_y_continuous(trans = "pseudo_log",
                     breaks = c(0:3, 100, 1000), minor_breaks = NULL)

另一种选择是将视图分为两部分,一个包含大值的细节,另一个显示所有小值:

Another option could be to split your view into two pieces, one with detail on the big values, the other showing all the small values:

object_count %>%
  mutate(biggies = if_else(count > 20, "Big", "Little")) %>%
  ggplot(aes(obj_num, count)) +
  geom_col() +
  facet_grid(~biggies, scales = "free") 

另一个选项可能会将10以下的所有计数加在一起.下面的版本强调对象名称和计数,并且已标记其他"类别以显示其包含的值.

Another option might be too lump together all the counts under 10. The version below emphasizes the object name and count, and the "Other" category has been labeled to show how many values it includes.

object_count %>%
  mutate(group = if_else(count < 10, "Others", object)) %>%
  group_by(group) %>%
  summarize(avg = mean(count), count = n()) %>%
  ungroup() %>%
  mutate(group = if_else(group == "Others",
                         paste0("Others (n =", count, ")"),
                         group)) %>%
  mutate(group = forcats::fct_reorder(group, avg)) %>%
  ggplot() + 
  geom_col(aes(group, avg)) +
  geom_text(aes(group, avg, label = round(avg, 0)), hjust = -0.5) +
  coord_flip()

如果您对总计数的份额感兴趣,还可以查看累积计数,并查看大值如何构成较大的份额:

If you're interested in the share of total count, you might also look at the cumulative count and see how the big values make up a large share:

object_count %>%
  mutate(cuml = cumsum(count)) %>%
  ggplot(aes(obj_num)) +
  geom_tile(aes(y = count + lag(cuml, default = 0),
            height = count))

这篇关于R:适合显示具有偏斜计数的数据的图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆