有没有更有效的方法用列表中的NA替换NULL? [英] Is there a more efficient way to replace NULL with NA in a list?
问题描述
我经常遇到结构如下的数据:
I quite often come across data that is structured something like this:
employees <- list(
list(id = 1,
dept = "IT",
age = 29,
sportsteam = "softball"),
list(id = 2,
dept = "IT",
age = 30,
sportsteam = NULL),
list(id = 3,
dept = "IT",
age = 29,
sportsteam = "hockey"),
list(id = 4,
dept = NULL,
age = 29,
sportsteam = "softball"))
在许多情况下,这样的列表可能长达数千万个项目,因此,内存问题和效率始终是一个问题.
In many cases such lists could be tens of millions of items long, so memory concerns and efficiency are always a concern.
我想将列表变成一个数据框,但是如果我运行:
I would like to turn the list into a dataframe but if I run:
library(data.table)
employee.df <- rbindlist(employees)
由于NULL值,我得到了错误.我通常的策略是使用类似以下的功能:
I get errors because of the NULL values. My normal strategy is to use a function like:
nullToNA <- function(x) {
x[sapply(x, is.null)] <- NA
return(x)
}
然后:
employees <- lapply(employees, nullToNA)
employee.df <- rbindlist(employees)
返回
id dept age sportsteam
1: 1 IT 29 softball
2: 2 IT 30 NA
3: 3 IT 29 hockey
4: 4 NA 29 softball
但是,将nullToNA函数应用于1000万个案例时,它的运行速度非常慢,因此,如果有一种更有效的方法,那就太好了.
However, the nullToNA function is very slow when applied to 10 million cases so it'd be good if there was a more efficient approach.
一点似乎使is.null函数变慢了一点,它一次只能应用于一项(不同于is.na可以一次扫描整个列表).
One point that seems to slow the process down it the is.null function can only be applied to one item at a time (unlike is.na which can scan a full list in one go).
关于如何在大型数据集上有效执行此操作的任何建议?
Any advice on how to do this operation efficiently on a large dataset?
推荐答案
R中的许多效率问题都是通过首先将原始数据更改为一种形式来解决的,该形式使后续过程尽可能地快速简便.通常,这是矩阵形式.
Many efficiency problems in R are solved by first changing the original data into a form that makes the processes that follow as fast and easy as possible. Usually, this is matrix form.
如果将所有数据与rbind
一起使用,则nullToNA
函数不再需要搜索嵌套列表,因此sapply
可以更有效地实现其目的(通过矩阵查找).从理论上讲,这应该使过程更快.
If you bring all the data together with rbind
, your nullToNA
function no longer has to search though nested lists, and therefore sapply
serves its purpose (looking though a matrix) more efficiently. In theory, this should make the process faster.
好的,顺便问一下.
> dat <- do.call(rbind, lapply(employees, rbind))
> dat
id dept age sportsteam
[1,] 1 "IT" 29 "softball"
[2,] 2 "IT" 30 NULL
[3,] 3 "IT" 29 "hockey"
[4,] 4 NULL 29 "softball"
> nullToNA(dat)
id dept age sportsteam
[1,] 1 "IT" 29 "softball"
[2,] 2 "IT" 30 NA
[3,] 3 "IT" 29 "hockey"
[4,] 4 NA 29 "softball"
这篇关于有没有更有效的方法用列表中的NA替换NULL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!