用第三种功能可视化大量点作为颜色-一种提高速度的方法 [英] Visualising big set of points with third feature as a color - a way to improve a speed

查看:96
本文介绍了用第三种功能可视化大量点作为颜色-一种提高速度的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的数据集(大约 5e5 行),其中(x,y)与其他功能配合使用 z 。就像这样:

  x<-rnorm(1e6,0,5)
y<-rnorm( 1e6,0,10)
dist<-sqrt(x ^ 2 + y ^ 2)
z<-exp(-(dist / 8)^ 2)

我想用 z 功能绘制它们,以用作色彩美学。但是,简单的 geom_point 花费如此大的数据集会花费一段时间:

  data .frame(x,y,z)%>%
ggplot()+ geom_point(aes(x,y,color = z))



所以我认为我需要一种汇总点的方法某种方式。一种方法是将一个平面划分为一些小正方形,并对位于正方形中的点的所有 z 值求平均值。但是从长远来看可能会有些麻烦,最好使用一些已经可用的工具。因此,我认为 geom_hex 作为一种几何图形在我看来很不错。但是默认情况下,填充美观设置为计数。所以我的问题是:




  • 可以默认填充 geom_hex 是否可以轻松更改为平均 z 功能?

  • 如果没有,我如何创建六边形而不是正方形,以便 z 值可以在六边形内求平均值然后绘制? / li>
  • 还有其他方法可以提高绘制此类数据集的速度吗?



编辑:



建议的解决方案比较:

  library(microbenchmark)
microbenchmark(
'stat_summary_hex'= {data.frame(x,y,z)%> ;%
ggplot(aes(x,y,z = z))+ stat_summary_hex(fun = function(x)mean(x))},
'round_and_group'= {data.frame(x, y,z)%>%
mutate(x = round(x,0)),y = round(y,0))%>%
group_by(x,y)%>%
summary(z = mean(z))%>%
ggplot()+ geom_hex(aes(x,y,fill = z),stat = identity)}


单位:毫秒
expr min lq平均uq最大值neval
stat_summary_hex 2。 243791 2.38539 2.454039 2.426123 2.50871 2.963176 100
round_and_group 183.785828 186.38851 188.296828 187.347476 189.10874 218.668487 100


解决方案

也许可以帮助 stat_summary_hex() stat_summary_2d()



它们与 stat_summary()相似,数据按 x y ,然后使用 stat_summary_hex中指定的函数由 z 汇总()(或 stat_summary_2d())。

  library(tidyverse)
data.frame(x,y,z)%&%;%
#在这里您可以指定欢迎z参数的功能
ggplot(aes(x ,y,z = z))+ stat_summary_hex(fun = function(x)mean(x))



它将代替您使用 geom_hex()来回答您的第二个问题(十六进制)和您的第三个问题(如您所言,似乎不错)因此似乎在 geom_hex()和速度之间进行了交易。



编辑



看着您的问题,我用不同的值对功能进行了微基准测试:

 单位:毫秒
expr min lq平均中位数uq最大neval
3.5e5 205.0363 214.6925 236.8149 225.2286 238.6536 494.7897 100
1e6 575.4861 597.4161 665.4396 620.9151 702.1622 1143.7011 100

此外,您还可以指定bin,以具有或多或少的精确十六进制。默认值应为30,这意味着它将在30 * 30的十六进制区域中绘制点:

  data。 frame(x,y,z)%>%
ggplot(aes(x,y,z = z))+ stat_summary_hex(fun = function(x)mean(x),bins = 60)

例如(



如您所见,添加的十六进制数越多,越接近原始点。






带有数据:

  x<-rnorm(1e4,0,5)
y<-rnorm(1e4, 0,10)
dist<-sqrt(x ^ 2 + y ^ 2)
z<-exp(-(dist / 8)^ 2)


I have a pretty big dataset (around 5e5 rows) of (x, y) coordinates with additional feature z. It's something like this:

x <- rnorm(1e6, 0, 5)
y <- rnorm(1e6, 0, 10)
dist <- sqrt(x^2 + y^2)
z <- exp(-(dist / 8)^2)

I want to plot them with a z feature used as a color aesthetic. But simple geom_point takes a while with such a big dataset:

data.frame(x, y, z) %>% 
  ggplot() + geom_point(aes(x, y, color = z)) 

So I think I need a way to aggregate points in some way. One approach would be to divide a plane to some small squares and average all the z values for points that lie in a square. But it can be a little cumbersome in the long term and it's probably better to use some of already available tools. So I thought about geom_hex as a geom that would look good in my case. But fill aesthetic is setup to count as default. So my questions are:

  • Can default fill value of geom_hex be easily changed to an average of z feature?
  • If not, how can I create hexagons instead of squares, so that z value can be averaged within hexagons and then plotted?
  • Is there any other way to improve a speed of plotting such a dataset?

Edit:

Comparison of proposed solutions:

library(microbenchmark)
microbenchmark(
  'stat_summary_hex' = {data.frame(x, y, z) %>%                                                                                                   
    ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x))},
  'round_and_group' = {data.frame(x, y, z) %>%                                                   
      mutate(x=round(x, 0), y=round(y, 0)) %>%                                  
      group_by(x,y) %>%                                                         
      summarize(z = mean(z)) %>%                                                
      ggplot() + geom_hex(aes(x, y, fill = z), stat="identity")}
)

Unit: milliseconds
             expr        min        lq       mean     median        uq        max neval
 stat_summary_hex   2.243791   2.38539   2.454039   2.426123   2.50871   2.963176   100
  round_and_group 183.785828 186.38851 188.296828 187.347476 189.10874 218.668487   100

解决方案

Maybe it could help stat_summary_hex(), or stat_summary_2d().

They are similar to stat_summary(), the data are divided in bins with x and y, then summarised by z, using the function specified in stat_summary_hex() (or stat_summary_2d()).

library(tidyverse)
data.frame(x, y, z) %>%  
# here you can specify the function that welcomes the z parameter                                                                                              
ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x))

It is going to answer the your second question (hex), and your third question (seems ok with perfomance as you stated), in place of using geom_hex() (so it seems there is a trade of between geom_hex() and velocity).

EDIT

Looking at your questions, I've microbenchmarked the function with different values:

Unit: milliseconds
  expr      min       lq     mean   median       uq       max neval
 3.5e5 205.0363 214.6925 236.8149 225.2286 238.6536  494.7897   100
   1e6 575.4861 597.4161 665.4396 620.9151 702.1622 1143.7011   100

Also, you can also specify the bins, to have more or less "precise" hexes. The default value should be 30, that means it's going to plot the points in an area of 30 * 30 hexes:

data.frame(x, y, z) %>%                                                                                            
ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x), bins = 60)

As example (here the multiplot() function, if necessary):

set.seed(1)
x <- rnorm(1e4, 0, 5)                                                     
y <- rnorm(1e4, 0, 10)                                                    
dist <- sqrt(x^2 + y^2)                                                   
z <- exp(-(dist / 8)^2) 

library(tidyverse)

a1 <- data.frame(x, y, z) %>% 
      ggplot() + geom_point(aes(x, y, color = z)) 

b1 <- data.frame(x, y, z) %>%  
     ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x))

c1 <- data.frame(x, y, z) %>%  
      ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x), bins = 60)

multiplot(a1,b1,c1, cols = 3)

As you can see, the more you add hexes, the most you are closer to your original points.


With data:

x <- rnorm(1e4, 0, 5)                                                     
y <- rnorm(1e4, 0, 10)                                                    
dist <- sqrt(x^2 + y^2)                                                   
z <- exp(-(dist / 8)^2) 

这篇关于用第三种功能可视化大量点作为颜色-一种提高速度的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆