新数据框列作为另一个函数(摘要)对我不起作用 [英] New dataframe column as function (digest) of another one is not working for me

查看:143
本文介绍了新数据框列作为另一个函数(摘要)对我不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个新的计算列(另一列文本的摘要).为了让您重现,我创建了一个df作为可重现的示例:

I want to create a new computed column (the digest of the text of another column). For you to reproduce I create a df as reproducible example:

df <- data.frame(name = replicate(1000, paste(sample(LETTERS, 20, replace=TRUE), collapse="")),stringsAsFactors=FALSE)

> head(df,3)
              name
1 ZKBOZVFKNJBRSDWTUEYR
2 RQPHUECABPQZLKZPTFLG
3 FTBVBEQTRLLUGUVHDKAY

现在我想要第二列,其中每一行都包含名称" col的摘要 效果很好,但是很慢(每个md5都不相同,并且是name列的对应摘要):

Now I want a 2nd column with the digest of the 'name' col for each row This works very well but it is slow (each md5 is different and it is the corresponding digest of the name column):

> df$md5 <- sapply(df$name, digest)   
> head(df, 3)
              name                              md5
1 ZKBOZVFKNJBRSDWTUEYR b8d93a9fe6cefb7a856e79f54bac01f2
2 RQPHUECABPQZLKZPTFLG 52f6acbd939df27e92232904ce094053
3 FTBVBEQTRLLUGUVHDKAY a401a8bc18f0cb367435b77afd353078

但是这个(使用dplyr)不起作用,我也不明白为什么:每一行的md5都一样!实际上,它是完整df $ name(包括所有行)的摘要.拜托,有人可以向我解释吗?

But this (using dplyr) does not work and I don't see why: the md5 is the same for each row! In fact it is the digest of the complete df$name, including all the rows. Please, can someone explain to me?

> df <- mutate(df, md5=digest(name))
> head(df, 3)
                  name                              md5
1 ZKBOZVFKNJBRSDWTUEYR 10aa31791d0b9288e819763d9a41efd8
2 RQPHUECABPQZLKZPTFLG 10aa31791d0b9288e819763d9a41efd8
3 FTBVBEQTRLLUGUVHDKAY 10aa31791d0b9288e819763d9a41efd8

再次使用数据表的方法,看来对于新变量使用标准方法似乎不起作用:

Again if I go the data table way, it seems that does not work using the standard way for new variables:

> dt <- data.table(df)
> dt[, md5:=digest(name)]  
> head(dt,3)
                   name                              md5
1: ZKBOZVFKNJBRSDWTUEYR 10aa31791d0b9288e819763d9a41efd8
2: RQPHUECABPQZLKZPTFLG 10aa31791d0b9288e819763d9a41efd8
3: FTBVBEQTRLLUGUVHDKAY 10aa31791d0b9288e819763d9a41efd8

如果我强制分组,则它可以再次工作(但速度很慢):

If I force to group then it works again (but slow):

> dt[,md5:=digest(name), by=name]   
> head(dt, 3)
                   name                              md5
1: ZKBOZVFKNJBRSDWTUEYR b8d93a9fe6cefb7a856e79f54bac01f2
2: RQPHUECABPQZLKZPTFLG 52f6acbd939df27e92232904ce094053
3: FTBVBEQTRLLUGUVHDKAY a401a8bc18f0cb367435b77afd353078

我还测试了Tapply并工作(创建了一个因子,但我的真实数据为数百万行,而且非常慢).

I have also tested tapply and works (creating a factor but my real data as millions of rows and it is very slow).

然后,首先,有人可以向我解释为什么dplyr突变体不使用每一行的值来计算摘要,以及为什么数据表符号会产生同样的想法(除非我分组)?

Then, first, can someone explain to me why the dplyr mutate is not taking the value of each row to compute the digest and why the same think happens with data table notation (unless I group)?

第二,是否有一种更快的方法来为所有行计算此摘要?

and second, is there a faster way do calculate this digest for all the rows?

推荐答案

考虑到您有一个非常大的数据集,最好在稍大的数据集上测试不同的方法(在本例中,我使用100000行,而较大的数据集则需要一定的时间在我的系统上):

Considering you have a very large dataset, it's better to test the different approaches on a somewhat larger dataset (for this example I use 100000 rows, bigger datasets take ages on my system):

df <- data.frame(name = replicate(1e5, paste(sample(LETTERS, 20, replace=TRUE), collapse="")), stringsAsFactors=FALSE)

首先,让我们考虑几种可用的方法:

First, let's consider several approaches available:

# base R
df$md5 <- sapply(df$name, digest)

# data.table (grouping by name, based on the assumption that all names are unique)
dt[, md5:=digest(name), name]

# data.table with a unique identifier for each row
dt[,indx:=.I][, md5:=digest(name), indx]

# dplyr (grouping by name, based on the assumption that all names are unique)
df %>% group_by(name) %>% mutate(md5=digest(name))

# dplyr with rowwise (from the other answer)
df %>% rowwise() %>% mutate(md5=digest(name))

第二,测试哪个方法最快:

Second, test which appraoch is the fastest:

library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
          baseR = df$md5 <- sapply(df$name, digest),
          dtbl1 = dt[, md5:=digest(name), name],
          dtbl2 = dt[,indx:=.I][, md5:=digest(name), indx],
          dplyr = df %>% group_by(name) %>% mutate(md5=digest(name)),
          rowwi = df %>% rowwise() %>% mutate(md5=digest(name)))

给出:

   test elapsed relative
2 dtbl1  77.878    1.000
3 dtbl2  78.343    1.006
1 baseR  81.399    1.045
5 rowwi 118.799    1.525
4 dplyr 129.748    1.666

因此,坚持使用基本R解决方案根本不是一个坏选择.我怀疑它在您的实际数据集上变慢的原因可能是digest函数,而不是某些软件包/函数的某些不当行为.

So, sticking to a base R solution isn't such a bad choice at all. I suspect that the reason why it's slow on your real dataset is probably the digest function and not some misbehavior of a certain package/function.

这篇关于新数据框列作为另一个函数(摘要)对我不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆