用R data.table中的NA替换所有空白的快速方法 [英] Fast way to replace all blanks with NA in R data.table
问题描述
我有一个很大的data.table对象(1M行和220列),我想用NA替换所有空格('').我在此 Post 中找到了解决方案,但是对于我的数据表来说,它的运行速度非常慢(耗时已超过15分钟)来自另一篇文章的示例:
I have a large data.table object (1M rows and 220 columns) and I want to replace all blanks ('') with NA. I found a solution in this Post, but it's extremely slow for my data table (takes already over 15mins) Example from the other post:
data = data.frame(cats=rep(c('', ' ', 'meow'),1e6),
dogs=rep(c("woof", " ", NA),1e6))
system.time(x<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))
是否有更多的data.table快速方法来实现这一目标?
Is there a more data.table fast way to achieve this?
实际上,所提供的数据看起来与原始数据不太相似,仅举一个例子.我的真实数据的以下子集给出了CharToDate(x)错误:
Indeed the provided data does not look much like the original data, it was just to give an example. The following subset of my real data gives the CharToDate(x) error:
DT <- data.table(ID=c(10),DEFAULT_DATE=as.Date("2012-07-31"),value='')
system.time(DT[DT=='']<-NA)
推荐答案
以下可能是通用的 data.table
方法.我还将使用您的正则表达式来处理几种类型的空格(我还没有看到其他答案这样做).您可能不应在所有 all 列上运行它,而应该仅在 factor
或 character
列上运行,因为其他类将不接受空值
Here's probably the generic data.table
way of doing this. I'm also going to use your regex which handles several types of blanks (I havn't seen other answers doing this). You probably shouldn't run this over all your columns rather only over the factor
or character
ones, because other classes won't accept blank values.
对于 factor
s
indx <- which(sapply(data, is.factor))
for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_integer_)
对于字符
s
indx2 <- which(sapply(data, is.character))
for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_)
这篇关于用R data.table中的NA替换所有空白的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!