根据类型有效地替换大型数据集中的负值 [英] Efficiently replacing negative values in large datasets conditional on type
问题描述
在我的数据集中:
# A tibble: 240 x 1,415
matchcode S001 S002 S002EVS S003 S003A S004 S006 S007 S007_01 S008 S009 S009A S010 S010_01 S010_02 S010_03 S010_04 S011 S012 S013 S013B S014 S015 S016 S017 S017A
<fct> <dbl> <dbl> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl> <fct> <fct> <dbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl>
1 "JPN 198~ 2 1 -4 392 392 -4 324 324 3920120324 -4 JP JP -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 0.6789805 0.6789805
2 "MEX 198~ 2 1 -4 484 484 -4 933 2130 4840120926 -4 MX MX -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1.1378840 1.1378840
3 "HUN 198~ 2 1 -4 348 348 -4 1280 4321 3480121280 -4 HU HU -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1.0635516 1.0635516
4 "AUS 198~ 2 1 -4 36 36 -4 973 5478 360120973 -4 AU AU -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 0.9616138 0.9616138
5 "ARG 198~ 2 1 -4 32 32 -4 874 6607 320120874 -4 AR AR -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 0.9266260 0.9266260
6 "FIN 198~ 2 1 -4 246 246 -4 385 7123 2460120385 -4 FI FI -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1.0000000 1.0000000
7 "KOR 198~ 2 1 -4 410 410 -4 3 7744 4100120003 -4 KR KR -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1.0000000 1.0000000
8 "ZAF 198~ 2 1 -4 710 710 -4 5420 10260 7100121549 -4 ZA ZA -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1.0000000 1.0000000
9 "ARG 199~ 2 2 -4 32 32 -4 856 11163 320240856 -4 AR AR 125 -4 -4 -4 -4 1210 -4 1 -4 -4 -4 -4 1.0000000 1.0000000
10 "BLR 199~ 2 2 -4 112 112 -4 106 11415 1120240106 -4 BY BY -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1.0000000 1.0000000
用NA替换所有负值,我使用了以下代码:
to replace all negative values with NA's, I used the following code:
df [ df < 0 ] <- NA
但是我只想在非以下列上执行此操作字符(我想摆脱错误消息,而又不抑制它们)。变量 charcol
包含应跳过的列的名称。我尝试过:
I however only want to have this operation carried out on columns that are not characters (I want to get rid of the error messages, without suppressing them). The variable charcol
holds the names of the columns that should be skipped. I tried:
df [-charcol] df [-charcol] < 0] <- NA
哪个给了我错误:
Error: cannot allocate vector of size 1.8 Gb
除了仍然给我警告:
In addition: Warning messages:
1: In Ops.factor(left, right) : ‘<’ not meaningful for factors
因子没有意义,尽管我可能知道语法错误,我想知道对于大型数据集,此类问题最有效的解决方案是什么。我一直在查看 data.table插图一段时间,但我无法真正弄清楚语法的用法。
Although I probably got the syntax wrong, I am wondering what would be the most efficient solution for such problems for large datasets. I have been looking at the data.table vignette for a while, but I cannot really figure out how to do the syntax.
有任何建议吗?
str(WVSsample)
Classes ‘data.table’ and 'data.frame': 240 obs. of 1415 variables:
$ matchcode : Factor w/ 240 levels "ALB 1998 ","ALB 2002 ",..: 108 134 88 12 4 73 117 232 5 25 ...
$ S001 :Class 'labelled' atomic [1:240] 2 2 2 2 2 2 2 2 2 2 ...
.. ..- attr(*, "label")= chr "Study"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
.. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
$ S002 :Class 'labelled' atomic [1:240] 1 1 1 1 1 1 1 1 2 2 ...
.. ..- attr(*, "label")= chr "Wave"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:11] -5 -4 -3 -2 -1 1 2 3 4 5 ...
.. .. ..- attr(*, "names")= chr [1:11] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
$ S002EVS :Class 'labelled' atomic [1:240] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
.. ..- attr(*, "label")= chr "EVS-wave"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:9] -5 -4 -3 -2 -1 1 2 3 4
.. .. ..- attr(*, "names")= chr [1:9] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
$ S003 :Class 'labelled' atomic [1:240] 392 484 348 36 32 246 410 710 32 112 ...
.. ..- attr(*, "label")= chr "Country/region"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:199] -5 -4 -3 -2 -1 4 8 12 16 20 ...
.. .. ..- attr(*, "names")= chr [1:199] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
$ S003A :Class 'labelled' atomic [1:240] 392 484 348 36 32 246 410 710 32 112 ...
.. ..- attr(*, "label")= chr "Country/regions [with split ups]"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:199] -5 -4 -3 -2 -1 4 8 12 16 20 ...
.. .. ..- attr(*, "names")= chr [1:199] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
$ S004 :Class 'labelled' atomic [1:240] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
.. ..- attr(*, "label")= chr "Set"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
.. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
编辑: @ chinsoon12使用以下代码段提及:
@chinsoon12 mentioned using the following piece of code:
f_dowle3 = function(DT) {
for (j in seq_len(ncol(DT)))
set(DT,which(is.na(DT[[j]])),j,0)
}
但是此代码没有做两件事:
This code however does not do two things:
-
它用零代替NA,而我想用NA代替负值。我需要更改
which(is.na(DT [[j]])
部分,例如DT [[j]])< 0
。
我将代码更改为:
f_dowle3 = function(DT) {
# or by number (slightly faster than by name) :
for (j in seq_len(ncol(DT)))
set(DT,which(DT[[j]]<0),j,NA)
}
但是这会使数据集为NULL,谁能帮助我
But this makes the dataset NULL. Could anyone help me with adapting the code properly?
推荐答案
由于这是重复项,因此将很快删除,因为不能容纳注释。
Since this is a dupe, will delete shortly as cannot fit in comments.
setDT(df)
cols <- names(df)[sapply(df, is.numeric)]
for (x in cols) {
set(df, which(df[[x]] < 0), x, NA_real_)
}
这篇关于根据类型有效地替换大型数据集中的负值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!