用R data.table中的NA替换所有空白的快速方法 [英] Fast way to replace all blanks with NA in R data.table

查看:58
本文介绍了用R data.table中的NA替换所有空白的快速方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的data.table对象(1M行和220列),我想用NA替换所有空格('').我在此 Post 中找到了解决方案,但是对于我的数据表来说,它的运行速度非常慢(耗时已超过15分钟)来自另一篇文章的示例:

I have a large data.table object (1M rows and 220 columns) and I want to replace all blanks ('') with NA. I found a solution in this Post, but it's extremely slow for my data table (takes already over 15mins) Example from the other post:

 data = data.frame(cats=rep(c('', ' ', 'meow'),1e6),
                   dogs=rep(c("woof", " ", NA),1e6))
 system.time(x<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))

是否有更多的data.table快速方法来实现这一目标?

Is there a more data.table fast way to achieve this?

实际上,所提供的数据看起来与原始数据不太相似,仅举一个例子.我的真实数据的以下子集给出了CharToDate(x)错误:

Indeed the provided data does not look much like the original data, it was just to give an example. The following subset of my real data gives the CharToDate(x) error:

DT <- data.table(ID=c(10),DEFAULT_DATE=as.Date("2012-07-31"),value='')
system.time(DT[DT=='']<-NA)

推荐答案

以下可能是通用的 data.table 方法.我还将使用您的正则表达式来处理几种类型的空格(我还没有看到其他答案这样做).您可能不应在所有 all 列上运行它,而应该仅在 factor character 列上运行,因为其他类将不接受空值

Here's probably the generic data.table way of doing this. I'm also going to use your regex which handles several types of blanks (I havn't seen other answers doing this). You probably shouldn't run this over all your columns rather only over the factor or character ones, because other classes won't accept blank values.

对于 factor s

indx <- which(sapply(data, is.factor))
for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_integer_) 

对于字符 s

indx2 <- which(sapply(data, is.character)) 
for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_)

这篇关于用R data.table中的NA替换所有空白的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆