ggplot scale_y_log10()问题 [英] ggplot scale_y_log10() issue

查看:861
本文介绍了ggplot scale_y_log10()问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个使用ggplot进行缩放的有趣问题。我有一个数据集,我可以使用默认的线性比例进行绘图,但是当我使用scale_y_log10()时,数字就会消失。这里是一些示例代码和两张图片。请注意,线性刻度的最大值为〜700,而对数刻度的结果为10 ^ 8。我告诉你,整个数据集只有~8000个条目长,所以有些东西是不对的。



我想这个问题与我的数据集结构有关,因为我无法在像'钻石'这样的常见数据集上复制此错误,但我不确定是排除故障的最佳方式。



谢谢,
zach cp




编辑:bdamarest可以像这样在钻石数据集上重现比例尺问题:

  example_1 = ggplot(diamonds,aes(x = clarity,fill = cut))+ 
geom_bar()+ scale_y_log10(); print(example_1)






 #data.melt是我的数据集的名称
> ggplot(data.melt,aes(name,fill = Library))+ geom_bar()
> ggplot(data.melt,aes(name,fill = Library))+ geom_bar()+ scale_y_log10()
> (data.melt $ name)
[1] 8003




这里是一些示例数据......我想我会看到这个问题。原始的融化数据集可能长达10〜8行。也许行号被用于统计?

 >头(data.melt)
文库名称组
221938 AB Arthrofactin糖肽
235087 AB Putisolvin环肽
235090 AB Putisolvin环肽
222125 AB Arthrofactin糖肽
311468 AB Triostin环状缩酚酸肽
92249 AB CDA脂肽


> (AB(AB),AB(AB),AB(AB),AB(AB),b $ b),名称= c(Arthrofactin,Putisolvin,Putisolvin,Arthrofactin,
Triostin,CDA),group = c(glycopeptide,cyclic peptide,
cyclic肽,糖肽,环缩酚酸肽,脂肽
)),.Names = c(库,名称,组),row.names = c(221938L,
235087L,235090L,222125L,311468L,92249L),class =data.frame)






更新:



行号不是问题。以下是使用相同aes x轴和填充颜色绘制的相同数据,缩放比例完全正确:

 > ggplot(data.melt,aes(name,fill = name))+ geom_bar()
> ggplot(data.melt,aes(name,fill = name))+ geom_bar()+ scale_y_log10()
> (data.melt $ name)
[1] 8003


scale_y_log10 (或任何对数刻度)不能很好地协同工作,也不会给出预期的结果。

第一个基本问题是棒线变为0,并且在对数刻度上,0变换为负无穷(这很难绘制)。婴儿床通常从1开始而不是0(因为$ \log(1)= 0 $),如果有0个计数则不绘制任何图形,也不担心变形,因为如果需要对数刻度,不关心被1(不一定是真的,但是...)

我使用钻石 @dbemarest显示的示例。

一般来说,这是为了转换坐标,而不是缩放(稍后会有更多的差异)。


$ b $

  ggplot(钻石,aes(x =净度,fill = cut))+ 
geom_bar()+
coord_trans( ytrans =log10)

但是,这给出了一个错误
$ b如果(长度(从)== 1 || abs(来自[1] - 来自[2])的误差<1e-06)return(mean(to )):
缺少值,其中TRUE / FALSE需要

由负无穷问题。



当您使用比例转换时,转换会应用到da ta,然后进行统计和排列,然后标尺以逆转换(粗略地)标记。

  DF < -  ddply(diamonds,。(clear,cut) ),总结,n =长度(清晰度))
DF $ log10n <-log10(DF $ n)

给出的是

 >头(DF)
清晰度减少n log10n
1 I1公平210 2.322219
2 I1良好96 1.982271
3 I1很好84 1.924279
4 I1保费205 2.311754
5 I1理想146 2.164353
6 SI2交易466 2.668386

如果我们绘制这个以正常的方式,我们得到预期的条形图:

  ggplot(DF,aes(x = clarity,y = n, fill = cut))+ 
geom_bar(stat =identity)


和缩放y轴给出相同的问题,因为使用未预先汇总的数据。

  ggplot(DF,aes(x = clarity,y = n,fill = cut))+ 
geom_bar(stat =identity)+
scale_y_log10()



我们可以看看问题是如何发生的绘制计数的 log10()值。

  ggplot(DF ,aes(x = clear,y = log10n,fill = cut))+ 
geom_bar(stat =identity)



这看起来就像 scale_y_log10 ,但标签是0,5,10,而不是10 ^ 0,10 ^ 5,10 ^ 10,...

因此,使用 scale_y_log10 来计数,将它们转换为日志,堆叠这些日志,然后以反日志形式显示比例。然而,堆积日志不是一种线性转换,因此你要求它做的事情没有任何意义。



底线是堆叠的条形图日志规模没有多大意义,因为它们不能从0开始(条的底部应该是这样),并且比较条的部分是不合理的,因为它们的大小取决于它们在堆栈中的位置。考虑改为:

  ggplot(diamonds,aes(x = clarity,y = .. count ..,color = cut ))+ 
geom_point(stat =bin)+
scale_y_log10()



或者如果你真的很希望堆叠酒吧通常会给你的团体总数,你可以做这样的事情:

  ggplot(钻石, aes(x = clarity,y = .. count ..))+ 
geom_point(aes(color = cut),stat =bin)+
geom_point(stat =bin,color = black)+
scale_y_log10()


I've run into an interesting problem with scaling using ggplot. I have a dataset that I can graph just fine using the default linear scale but when I use scale_y_log10() the numbers go way off. Here is some example code and two pictures. Note that the max value in the linear scale is ~700 while the log scaling results in a value of 10^8. I show you that the entire dataset is only ~8000 entries long so something is not right.

I imagine the problem has something to do with the structure of my dataset and the binning as I cannot replicate this error on a common dataset like 'diamonds.' However I am not sure the best way to troubleshoot.

thanks, zach cp


Edit: bdamarest can reproduce the scale problem on the diamond dataset like this:

example_1 = ggplot(diamonds, aes(x=clarity, fill=cut)) + 
  geom_bar() + scale_y_log10(); print(example_1)


#data.melt is the name of my dataset    
> ggplot(data.melt, aes(name, fill= Library)) + geom_bar()  
> ggplot(data.melt, aes(name, fill= Library)) + geom_bar()  + scale_y_log10()
> length(data.melt$name)
[1] 8003 

here is some example data... and I think I see the problem. The original melted dataset may have been ~10^8 rows long. Maybe the row numbers are being used for the stats?

> head(data.melt)
       Library         name               group
221938      AB Arthrofactin        glycopeptide
235087      AB   Putisolvin      cyclic peptide
235090      AB   Putisolvin      cyclic peptide
222125      AB Arthrofactin        glycopeptide
311468      AB     Triostin cyclic depsipeptide
92249       AB          CDA         lipopeptide


> dput(head(test2))
structure(list(Library = c("AB", "AB", "AB", "AB", "AB", "AB"
), name = c("Arthrofactin", "Putisolvin", "Putisolvin", "Arthrofactin", 
"Triostin", "CDA"), group = c("glycopeptide", "cyclic peptide", 
"cyclic peptide", "glycopeptide", "cyclic depsipeptide", "lipopeptide"
)), .Names = c("Library", "name", "group"), row.names = c(221938L, 
235087L, 235090L, 222125L, 311468L, 92249L), class = "data.frame")


UPDATE:

Row numbers are not the issue. Here is the same data graphed using the same aes x-axis and fill color and the scaling is entirely correct:

> ggplot(data.melt, aes(name, fill= name)) + geom_bar()
> ggplot(data.melt, aes(name, fill= name)) + geom_bar() + scale_y_log10()
> length(data.melt$name)
[1] 8003

解决方案

geom_bar and scale_y_log10 (or any logarithmic scale) do not work well together and do not give expected results.

The first fundamental problem is that bars go to 0, and on a logarithmic scale, 0 is transformed to negative infinity (which is hard to plot). The crib around this usually to start at 1 rather than 0 (since $\log(1)=0$), not plot anything if there were 0 counts, and not worry about the distortion because if a log scale is needed you probably don't care about being off by 1 (not necessarily true, but...)

I'm using the diamonds example that @dbemarest showed.

To do this in general is to transform the coordinate, not the scale (more on the difference later).

ggplot(diamonds, aes(x=clarity, fill=cut)) +
  geom_bar() +
  coord_trans(ytrans="log10")

But this gives an error

Error in if (length(from) == 1 || abs(from[1] - from[2]) < 1e-06) return(mean(to)) : 
  missing value where TRUE/FALSE needed

which arises from the negative infinity problem.

When you use a scale transformation, the transformation is applied to the data, then stats and arrangements are made, then the scales are labeled in the inverse transformation (roughly). You can see what is happening by breaking out the calculations yourself.

DF <- ddply(diamonds, .(clarity, cut), summarise, n=length(clarity))
DF$log10n <- log10(DF$n)

which gives

> head(DF)
  clarity       cut   n   log10n
1      I1      Fair 210 2.322219
2      I1      Good  96 1.982271
3      I1 Very Good  84 1.924279
4      I1   Premium 205 2.311754
5      I1     Ideal 146 2.164353
6     SI2      Fair 466 2.668386

If we plot this in the normal way, we get the expected bar plot:

ggplot(DF, aes(x=clarity, y=n, fill=cut)) + 
  geom_bar(stat="identity")

and scaling the y axis gives the same problem as using the not pre-summarized data.

ggplot(DF, aes(x=clarity, y=n, fill=cut)) +
  geom_bar(stat="identity") +
  scale_y_log10()

We can see how the problem happens by plotting the log10() values of the counts.

ggplot(DF, aes(x=clarity, y=log10n, fill=cut)) +
  geom_bar(stat="identity")

This looks just like the one with the scale_y_log10, but the labels are 0, 5, 10, ... instead of 10^0, 10^5, 10^10, ...

So using scale_y_log10 makes the counts, converts them to logs, stacks those logs, and then displays the scale in the anti-log form. Stacking logs, however, is not a linear transformation, so what you have asked it to do does not make any sense.

The bottom line is that stacked bar charts on a log scale don't make much sense because they can't start at 0 (where the bottom of a bar should be), and comparing parts of the bar is not reasonable because their size depends on where they are in the stack. Considered instead something like:

ggplot(diamonds, aes(x=clarity, y=..count.., colour=cut)) + 
  geom_point(stat="bin") +
  scale_y_log10()

Or if you really want a total for the groups that stacking the bars usually would give you, you can do something like:

ggplot(diamonds, aes(x=clarity, y=..count..)) + 
  geom_point(aes(colour=cut), stat="bin") +
  geom_point(stat="bin", colour="black") +
  scale_y_log10()

这篇关于ggplot scale_y_log10()问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆