根据类型有效地替换大型数据集中的负值 [英] Efficiently replacing negative values in large datasets conditional on type

查看:58
本文介绍了根据类型有效地替换大型数据集中的负值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的数据集中:

# A tibble: 240 x 1,415
   matchcode S001  S002  S002EVS S003  S003A S004  S006  S007  S007_01    S008  S009  S009A S010  S010_01 S010_02 S010_03 S010_04 S011  S012  S013  S013B S014  S015  S016  S017      S017A    
   <fct>     <dbl> <dbl> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl>  <dbl> <fct> <fct> <dbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl>
 1 "JPN 198~ 2     1     -4      392   392   -4     324    324 3920120324 -4    JP    JP     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    0.6789805 0.6789805
 2 "MEX 198~ 2     1     -4      484   484   -4     933   2130 4840120926 -4    MX    MX     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.1378840 1.1378840
 3 "HUN 198~ 2     1     -4      348   348   -4    1280   4321 3480121280 -4    HU    HU     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0635516 1.0635516
 4 "AUS 198~ 2     1     -4       36    36   -4     973   5478  360120973 -4    AU    AU     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    0.9616138 0.9616138
 5 "ARG 198~ 2     1     -4       32    32   -4     874   6607  320120874 -4    AR    AR     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    0.9266260 0.9266260
 6 "FIN 198~ 2     1     -4      246   246   -4     385   7123 2460120385 -4    FI    FI     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0000000 1.0000000
 7 "KOR 198~ 2     1     -4      410   410   -4       3   7744 4100120003 -4    KR    KR     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0000000 1.0000000
 8 "ZAF 198~ 2     1     -4      710   710   -4    5420  10260 7100121549 -4    ZA    ZA     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0000000 1.0000000
 9 "ARG 199~ 2     2     -4       32    32   -4     856  11163  320240856 -4    AR    AR    125   -4      -4      -4      -4      1210  -4     1    -4    -4    -4    -4    1.0000000 1.0000000
10 "BLR 199~ 2     2     -4      112   112   -4     106  11415 1120240106 -4    BY    BY     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0000000 1.0000000

用NA替换所有负值,我使用了以下代码:

to replace all negative values with NA's, I used the following code:

df [ df < 0 ] <- NA

但是我只想在非以下列上执行此操作字符(我想摆脱错误消息,而又不抑制它们)。变量 charcol 包含应跳过的列的名称。我尝试过:

I however only want to have this operation carried out on columns that are not characters (I want to get rid of the error messages, without suppressing them). The variable charcol holds the names of the columns that should be skipped. I tried:

df [-charcol] df [-charcol] < 0] <- NA

哪个给了我错误:

Error: cannot allocate vector of size 1.8 Gb

除了仍然给我警告:

In addition: Warning messages:
1: In Ops.factor(left, right) : ‘<’ not meaningful for factors

因子没有意义,尽管我可能知道语法错误,我想知道对于大型数据集,此类问题最有效的解决方案是什么。我一直在查看 data.table插图一段时间,但我无法真正弄清楚语法的用法。

Although I probably got the syntax wrong, I am wondering what would be the most efficient solution for such problems for large datasets. I have been looking at the data.table vignette for a while, but I cannot really figure out how to do the syntax.

有任何建议吗?

str(WVSsample)
Classes ‘data.table’ and 'data.frame':  240 obs. of  1415 variables:
 $ matchcode  : Factor w/ 240 levels "ALB 1998 ","ALB 2002 ",..: 108 134 88 12 4 73 117 232 5 25 ...
 $ S001       :Class 'labelled'  atomic [1:240] 2 2 2 2 2 2 2 2 2 2 ...
  .. ..- attr(*, "label")= chr "Study"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
  .. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S002       :Class 'labelled'  atomic [1:240] 1 1 1 1 1 1 1 1 2 2 ...
  .. ..- attr(*, "label")= chr "Wave"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:11] -5 -4 -3 -2 -1 1 2 3 4 5 ...
  .. .. ..- attr(*, "names")= chr [1:11] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S002EVS    :Class 'labelled'  atomic [1:240] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
  .. ..- attr(*, "label")= chr "EVS-wave"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:9] -5 -4 -3 -2 -1 1 2 3 4
  .. .. ..- attr(*, "names")= chr [1:9] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S003       :Class 'labelled'  atomic [1:240] 392 484 348 36 32 246 410 710 32 112 ...
  .. ..- attr(*, "label")= chr "Country/region"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:199] -5 -4 -3 -2 -1 4 8 12 16 20 ...
  .. .. ..- attr(*, "names")= chr [1:199] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S003A      :Class 'labelled'  atomic [1:240] 392 484 348 36 32 246 410 710 32 112 ...
  .. ..- attr(*, "label")= chr "Country/regions [with split ups]"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:199] -5 -4 -3 -2 -1 4 8 12 16 20 ...
  .. .. ..- attr(*, "names")= chr [1:199] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S004       :Class 'labelled'  atomic [1:240] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
  .. ..- attr(*, "label")= chr "Set"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
  .. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...

编辑: @ chinsoon12使用以下代码段提及:

@chinsoon12 mentioned using the following piece of code:

f_dowle3 = function(DT) {
  for (j in seq_len(ncol(DT)))
    set(DT,which(is.na(DT[[j]])),j,0)
}

但是此代码没有做两件事:

This code however does not do two things:


  1. 它用零代替NA,而我想用NA代替负值。我需要更改 which(is.na(DT [[j]])部分,例如 DT [[j]])< 0

我将代码更改为:

f_dowle3 = function(DT) {
  # or by number (slightly faster than by name) :
  for (j in seq_len(ncol(DT)))
    set(DT,which(DT[[j]]<0),j,NA)
}

但是这会使数据集为NULL,谁能帮助我

But this makes the dataset NULL. Could anyone help me with adapting the code properly?

推荐答案

由于这是重复项,因此将很快删除,因为不能容纳注释。

Since this is a dupe, will delete shortly as cannot fit in comments.

setDT(df)
cols <- names(df)[sapply(df, is.numeric)]
for (x in cols) {
    set(df, which(df[[x]] < 0), x, NA_real_)
}

这篇关于根据类型有效地替换大型数据集中的负值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆