检查和可视化大型数据帧中的间隙/空白和结构 [英] Inspecting and visualizing gaps/blanks and structure in large dataframes

查看:40
本文介绍了检查和可视化大型数据帧中的间隙/空白和结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大数据框 (400000 x 50),我想目视检查结构和空白/间隙.

I have a large dataframe (400000 x 50) that I want to visually inspect for structure and blanks/gaps.

有没有现成的库或者ggplot2函数,可以吐出这样的图片:

Is there an existing library or ggplot2 function, that can spit out a picture like this:

红色可能是日期",蓝色可能是因子",绿色可能是字符",黑色可能是空白/NA.

Where red might be "Dates", blue for "factors", green for "characters", and black for blanks/NAs.

推荐答案

您是否尝试过 lasagnar 中的 dfviewr ?以下为包中的 50 行 x 10 列 df.in 再现所需的图形:

Have you tried dfviewr in lasagnar ? The following reproduces the desired graphic for the 50 row x 10 column df.in in the package:

library(devtools)
install_github("swihart/lasagnar")
library(lasagnar)   
dfviewr(df=df.in)
## also try:
##dfviewr(df=df.in, legend=FALSE)
##dfviewr(df=df.in, gridlines=FALSE)

所以,公平地说,dfviewr 在提出问题时并不存在,但要了解导致其发展的一些想法以及如何实际可视化 400,000 行,请参阅最底部的 for 循环,不要太鲁莽,在 df2.in (400,000 x 50) 上运行该函数:

So, to be fair, dfviewr didn't exist at the time of the question, but to see some of the ideas that led to its development and how to actually visualize 400,000 rows, see the for-loop at the very bottom, and don't be too foolhardy and run the function on df2.in (400,000 x 50):

## Do not run:
## system.time(dfviewr(df=df2.in, gridlines=FALSE)) ## 10 minutes before useRaster=TRUE                                          
                                                    ##  2 minutes after

此外,tabplot:::tableplot() 似乎不支持日期或字符:

Also, tabplot:::tableplot() doesn't seem to support dates or characters:

library(tabplot)
tableplot(df.in)

产生:

ff(initdata = initdata, length = length, levels = levels,ordered =ordered, : vmode 'character' 未实现错误

因此我们消除了字符列(#9):

and so we eliminate the character column (#9):

tableplot(df.in[,c(-9)])

产生:

UseMethod("as.hi") 中的错误:没有适用于 'as.hi' 的方法应用于类 "c('POSIXct', 'POSIXt')" 的对象

所以我们也删除了第一列(日期):

so we eliminate the first column (Date) as well:

tableplot(df.in[,c(-1,-9)])

并得到

对于没有日期或字符列的 400,000 x 50 df2.in,图像渲染非常快(6 秒):

And for the 400,000 by 50 df2.in without the Date or character columns, the image rendering was quite quick (6 seconds):

system.time(tableplot(df2.in[,c(-(1+seq(0,40,10)), -(9+seq(0,40,10))) ]))

我首先展示了一个 50 行的小例子,然后是一个 400,000 行的例子.

I present first a baby example on 50 rows, then an example on the 400,000 rows.

就其价值而言,我赞同@cmbarbu 的评论,即在视觉上查看同一绘图上的 400K 行受到屏幕的限制,该屏幕最多具有 2K 像素的高度,因此某种形式的跨页面拆分可能是有益的以防止过度绘图.我通过在 1000 个图/页中制作一个包含 400 行的 PDF 文档来尝试打破这种情况.

For what it's worth, I second the comment by @cmbarbu about visually looking at 400K rows on the same plot being limited by a screen that at best has 2K pixels in height, so some kind of breaking apart across pages might be beneficial to prevent overplotting. I include an attempt at this breaking apart by making a PDF document with 400 rows in 1000 plots/pages.

我不知道有什么函数可以使用 data.frame 作为输入来呈现请求的绘图.我的方法将制作 data.frame 的矩阵掩码,然后使用 lasagna()="lasagnar on github">lasagnar github 上的包.lasagna() 是函数 image( t(X)[, (nrow(X):1)] ) 的包装器,其中 X是一个矩阵.此调用对行重新排序,以便它们与 data.frame 的顺序匹配,并且包装器允许切换网格线并添加图例(legend=TRUE 将调用 image.plot( t(X)[, (nrow(X):1)] ) -- 但是,在下面的示例中,我明确添加了一个不使用 image.plot() 的图例.

I do not know of a function that will render the requested plot with a data.frame being an input. My approach will make a matrix mask of the data.frame and then use lasagna() from the lasagnar package on github. lasagna() is a wrapper for the function image( t(X)[, (nrow(X):1)] ) where X is a matrix. This call reorders the rows so that they match the order of the data.frame, and the wrapper allows the ability to toggle grid lines and add legends (legend=TRUE will invoke image.plot( t(X)[, (nrow(X):1)] ) -- however, in the example below I explicitly add a legend not using image.plot()).

library(fields)
library(colorspace)  
library(lubridate)
library(devtools)
install_github("swihart/lasagnar")
library(lasagnar)   

创建一个 50 行的示例数据帧(400K 示例之前的婴儿示例)

df.in <- data.frame(date=seq(ymd('2012-04-07'),ymd('2013-03-22'), 
                    by = '1 week'),
           col1=rnorm(50),
           col2=rnorm(50),
           col3=rnorm(50),
           col4=rnorm(50),
           col5=as.factor(c("A","B")),
           col6=as.factor(c("MS","PHD")),
           col7=rnorm(50),
           col8=(c("cherlene","randy")),
           col9=rnorm(50),
           stringsAsFactors=FALSE)

诱导缺失

df.in[19:23  , 2:4  ] <- NA
df.in[c(7, 9),      ] <- NA
df.in[2:30   , 4    ] <- NA
df.in[10     , 7    ] <- NA
df.in[14     , 6:10 ] <- NA

检查结构

str(df.in)

准备掩码矩阵

mat.out <- matrix(NA, nrow=nrow(df.in), ncol=ncol(df.in))

然后循环遍历类型的列;最后应用is.na()

## red for dates
mat.out[,sapply(df.in,is.POSIXct)] <- 1
## blue for factors
mat.out[,sapply(df.in,is.factor)] <- 2
## green for characters
mat.out[,sapply(df.in,is.character)] <- 3
## white for numeric
mat.out[,sapply(df.in,is.numeric)] <- 4
## black for NA
mat.out[is.na(df.in)] <- 5

行名可能很适合追溯到原始数据

row.names(mat.out) <- 1:nrow(df.in)

render { lasagna(X) 是 image( t(X)[, (nrow(X):1)] ) }

lasagna(mat.out, col=c("red","blue","green","white","black"), 
        cex=0.67, main="")

lasagna(mat.out, col=c("red","blue","green","white","black"), 
        cex=.67, main="")
legend("bottom", fill=c("red","blue","green","white","black"),
        legend=c("dates", "factors", "characters", "numeric", "NA"), 
        horiz=T, xpd=NA, inset=c(-.15), border="black")

lasagna(mat.out, col=c("red","blue","green","white","black"), 
        cex=.67, main="", gridlines=FALSE)
legend("bottom", fill=c("red","blue","green","white","black"),
        legend=c("dates", "factors", "characters", "numeric", "NA"), 
        horiz=T, xpd=NA, inset=c(-.15), border="black")

df2.10 <- data.frame(date=seq(ymd('2012-04-07'),ymd('2013-03-22'), 
                    by = '1 week'),
           col1=rnorm(400000),
           col2=rnorm(400000),
           col3=rnorm(400000),
           col4=rnorm(400000),
           col5=as.factor(c("A","B")),
           col6=as.factor(c("MS","PHD")),
           col7=rnorm(400000),
           col8=(c("cherlene","randy")),
           col9=rnorm(400000),
           stringsAsFactors=FALSE)

诱导缺失

df2.10[c(19:23), c(2:4)  ] <- NA
df2.10[c(7,  9),         ] <- NA
df2.10[c(2:30), 4        ] <- NA
df2.10[10     , 7        ] <- NA
df2.10[14     , c(6:10)  ] <- NA    
df2.10[c(450:750), ] <- NA
df2.10[c(399990:399999), ] <- NA

cbind 成 50 列宽的 df;检查结构

df2.in <- cbind(df2.10, df2.10, df2.10, df2.10, df2.10)
str(df2.in)

准备掩码矩阵

mat.out <- matrix(NA, nrow=nrow(df2.in), ncol=ncol(df2.in))

然后循环遍历类型的列;最后应用is.na()

## red for dates
mat.out[,sapply(df2.in,is.POSIXct)] <- 1
## blue for factors
mat.out[,sapply(df2.in,is.factor)] <- 2
## green for characters
mat.out[,sapply(df2.in,is.character)] <- 3
## white for numeric
mat.out[,sapply(df2.in,is.numeric)] <- 4
## black for NA
mat.out[is.na(df2.in)] <- 5

行名可能很适合追溯到原始数据

row.names(mat.out) <- 1:nrow(df2.in)

render { lasagna_plain(X) 没有网格线或行名}

pdf("pages1000.pdf")
  system.time(
    for(i in 1:1000){
        lasagna_plain(mat.out[((i-1)*400+1):(400*i),],
                      col=c("red","blue","green","white","black"), cex=1, 
                      main=paste0("rows: ", (i-1)*400+1,  " - ",  (400*i)))
    }
  )
dev.off()

for 循环在我的机器上完成了 40 秒,此后不久就完成了 PDF.现在只需在 PDF 查看器中标准化页面大小后向下翻页,查看如下页面/图:

The for-loop completed 40 seconds on my machine, and the PDF very shortly thereafter. Now just page down after standardizing the page size in the PDF viewer, viewing pages/plots such as these:

这篇关于检查和可视化大型数据帧中的间隙/空白和结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆