带对数刻度的条形图 [英] Bar plot with log scales

查看:60
本文介绍了带对数刻度的条形图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用ggplot缩放时,我遇到了一个有趣的问题.我有一个数据集,可以使用默认的线性比例来绘制图形,但是当我使用scale_y_log10()时,数字会大大减少.这是一些示例代码和两张图片.请注意,线性标度的最大值为〜700,而对数标度的结果为10 ^ 8.我向您展示了整个数据集只有大约8000个条目,所以有些问题是不对的.

I've run into an interesting problem with scaling using ggplot. I have a dataset that I can graph just fine using the default linear scale but when I use scale_y_log10() the numbers go way off. Here is some example code and two pictures. Note that the max value in the linear scale is ~700 while the log scaling results in a value of 10^8. I show you that the entire dataset is only ~8000 entries long so something is not right.

我认为问题与数据集的结构和合并有关,因为我无法在钻石"等常见数据集上复制此错误.但是,我不确定解决问题的最佳方法.

I imagine the problem has something to do with the structure of my dataset and the binning as I cannot replicate this error on a common dataset like 'diamonds.' However I am not sure the best way to troubleshoot.

谢谢,扎克cp

bdamarest可以在钻石数据集上重现比例问题,如下所示:

bdamarest can reproduce the scale problem on the diamond dataset like this:

example_1 = ggplot(diamonds, aes(x=clarity, fill=cut)) + 
  geom_bar() + scale_y_log10(); print(example_1)


#data.melt is the name of my dataset    
> ggplot(data.melt, aes(name, fill= Library)) + geom_bar()  
> ggplot(data.melt, aes(name, fill= Library)) + geom_bar()  + scale_y_log10()
> length(data.melt$name)
[1] 8003 

这是一些示例数据...我想我看到了问题.原始的融化数据集的长度可能约为10 ^ 8行.也许行号被用于统计?

here is some example data... and I think I see the problem. The original melted dataset may have been ~10^8 rows long. Maybe the row numbers are being used for the stats?

> head(data.melt)
       Library         name               group
221938      AB Arthrofactin        glycopeptide
235087      AB   Putisolvin      cyclic peptide
235090      AB   Putisolvin      cyclic peptide
222125      AB Arthrofactin        glycopeptide
311468      AB     Triostin cyclic depsipeptide
92249       AB          CDA         lipopeptide


> dput(head(test2))
structure(list(Library = c("AB", "AB", "AB", "AB", "AB", "AB"
), name = c("Arthrofactin", "Putisolvin", "Putisolvin", "Arthrofactin", 
"Triostin", "CDA"), group = c("glycopeptide", "cyclic peptide", 
"cyclic peptide", "glycopeptide", "cyclic depsipeptide", "lipopeptide"
)), .Names = c("Library", "name", "group"), row.names = c(221938L, 
235087L, 235090L, 222125L, 311468L, 92249L), class = "data.frame")


更新:

行数不是问题.这是使用相同的AES X轴和填充颜色绘制的相同数据,并且缩放比例完全正确:

Row numbers are not the issue. Here is the same data graphed using the same aes x-axis and fill color and the scaling is entirely correct:

> ggplot(data.melt, aes(name, fill= name)) + geom_bar()
> ggplot(data.melt, aes(name, fill= name)) + geom_bar() + scale_y_log10()
> length(data.melt$name)
[1] 8003

推荐答案

geom_bar scale_y_log10 (或任何对数刻度)不能很好地协同工作,也无法提供预期的效果结果.

geom_bar and scale_y_log10 (or any logarithmic scale) do not work well together and do not give expected results.

第一个基本问题是,条形变为0,并且在对数刻度上,0转换为负无穷大(很难绘制).周围的婴儿床通常从1而不是0开始(因为$ \ log(1)= 0 $),如果计数为0,则不绘制任何内容,也不担心失真,因为如果需要对数刻度,您可能不知道不在乎被1(不一定是真实的,但是...)

The first fundamental problem is that bars go to 0, and on a logarithmic scale, 0 is transformed to negative infinity (which is hard to plot). The crib around this usually to start at 1 rather than 0 (since $\log(1)=0$), not plot anything if there were 0 counts, and not worry about the distortion because if a log scale is needed you probably don't care about being off by 1 (not necessarily true, but...)

我使用@dbemarest显示的钻石示例.

I'm using the diamonds example that @dbemarest showed.

通常这样做是变换坐标,而不是缩放比例(稍后会详细介绍差异).

To do this in general is to transform the coordinate, not the scale (more on the difference later).

ggplot(diamonds, aes(x=clarity, fill=cut)) +
  geom_bar() +
  coord_trans(ytrans="log10")

但这会导致错误

Error in if (length(from) == 1 || abs(from[1] - from[2]) < 1e-06) return(mean(to)) : 
  missing value where TRUE/FALSE needed

这是由负无穷大问题引起的.

which arises from the negative infinity problem.

使用比例转换时,将转换应用于数据,然后进行统计和安排,然后在逆转换中(大致)标记比例.您可以自己进行计算,以了解正在发生的事情.

When you use a scale transformation, the transformation is applied to the data, then stats and arrangements are made, then the scales are labeled in the inverse transformation (roughly). You can see what is happening by breaking out the calculations yourself.

DF <- ddply(diamonds, .(clarity, cut), summarise, n=length(clarity))
DF$log10n <- log10(DF$n)

给出

> head(DF)
  clarity       cut   n   log10n
1      I1      Fair 210 2.322219
2      I1      Good  96 1.982271
3      I1 Very Good  84 1.924279
4      I1   Premium 205 2.311754
5      I1     Ideal 146 2.164353
6     SI2      Fair 466 2.668386

如果以正常方式绘制此图,则会得到预期的条形图:

If we plot this in the normal way, we get the expected bar plot:

ggplot(DF, aes(x=clarity, y=n, fill=cut)) + 
  geom_bar(stat="identity")

和缩放y轴与使用未预先汇总的数据存在相同的问题.

and scaling the y axis gives the same problem as using the not pre-summarized data.

ggplot(DF, aes(x=clarity, y=n, fill=cut)) +
  geom_bar(stat="identity") +
  scale_y_log10()

我们可以通过绘制计数的 log10()值来查看问题的发生方式.

We can see how the problem happens by plotting the log10() values of the counts.

ggplot(DF, aes(x=clarity, y=log10n, fill=cut)) +
  geom_bar(stat="identity")

这看起来像是带有 scale_y_log10 的标签,但是标签是0、5、10,...,而不是10 ^ 0、10 ^ 5、10 ^ 10....

This looks just like the one with the scale_y_log10, but the labels are 0, 5, 10, ... instead of 10^0, 10^5, 10^10, ...

因此,使用 scale_y_log10 进行计数,将其转换为日志,堆叠这些日志,然后以反对数形式显示刻度.但是,堆叠日志不是线性变换,因此您要求它执行的操作没有任何意义.

So using scale_y_log10 makes the counts, converts them to logs, stacks those logs, and then displays the scale in the anti-log form. Stacking logs, however, is not a linear transformation, so what you have asked it to do does not make any sense.

最重要的是,对数刻度上堆积的条形图没有太大意义,因为它们不能从0开始(条形的底部应该在该位置),并且比较条形图的某些部分是不合理的,因为它们的大小取决于它们在堆栈中的位置.而是考虑类似以下内容:

The bottom line is that stacked bar charts on a log scale don't make much sense because they can't start at 0 (where the bottom of a bar should be), and comparing parts of the bar is not reasonable because their size depends on where they are in the stack. Considered instead something like:

ggplot(diamonds, aes(x=clarity, y=..count.., colour=cut)) + 
  geom_point(stat="bin") +
  scale_y_log10()

或者,如果您真的想要总计通常可以堆积条形图的组的总和,则可以执行以下操作:

Or if you really want a total for the groups that stacking the bars usually would give you, you can do something like:

ggplot(diamonds, aes(x=clarity, y=..count..)) + 
  geom_point(aes(colour=cut), stat="bin") +
  geom_point(stat="bin", colour="black") +
  scale_y_log10()

这篇关于带对数刻度的条形图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆