有效检查data.table中其他行的值 [英] Efficiently checking value of other row in data.table

查看:85
本文介绍了有效检查data.table中其他行的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:这是我最初发布到data.table帮助论坛的问题。 Matt Dowle要求一个更详细的示例,我发布了这个,但我有电子邮件格式化的麻烦。我已经知道如何在SO上格式化,所以我想我将它发布在这里。

Note: This is a question that I originally posted to the data.table help group. Matt Dowle asked for a more detailed example and I posted this one, but I had trouble with the formatting in e-mail. I already know how to format on SO, so I thought I would post it here instead.

我基本上试图做的是子集行从基于data.table对该行中的值以及在前一行或后一行中的值。现在,我为未来和过去的行创建了新列,然后在这些列上键入data.table,但这是资源密集型和繁重的。

What I am basically trying to do is subset rows from a data.table based on a value in that row as well as a value in a preceding or following row. Right now, I am creating new columns for future and past rows and then keying the data.table on these columns, but this is resource-intensive and onerous.

示例说明了我现在使用的方法。该示例使用文档中的单词(我对两者使用数字索引)。我想对一个特定单词进行子集,但前提是它之前或之后是另一个单词或一组单词:

The below example illustrates the approach I am using now. The example uses words in documents (I use numeric indices for both). I want to subset for a particular word, but only if it is preceded or followed by another word or set of words:

我首先创建一个包含十个文档的虚拟数据集一百万字。

I first create a dummy dataset with ten documents containing one million words. There are three unique words in the set.

library(data.table)
set.seed(1000)
DT<-data.table(wordindex=sample(1:3,1000000,replace=T),docindex=sample(1:10,1000000,replace=T))
setkey(DT,docindex)
DT[,position:=seq.int(1:.N),by=docindex]


          wordindex docindex position
      1:         1        1        1
      2:         1        1        2
      3:         3        1        3
      4:         3        1        4
      5:         1        1        5
    ---                            
 999996:         2       10    99811
 999997:         2       10    99812
 999998:         3       10    99813
 999999:         1       10    99814
1000000:         3       10    99815

请注意,只计算所有文件中第一个独特字词的出现次数是很容易和美观的。

Note that simply counting the occurrences of the first unique word across all documents is easy and beautiful.

setkey(DT,wordindex)
count<-DT[J(1),list(count.1=.N),by=docindex]
count

    docindex count.1
 1:        1   33533
 2:        2   33067
 3:        3   33538
 4:        4   33053
 5:        5   33231
 6:        6   33002
 7:        7   33369
 8:        8   33353
 9:        9   33485
10:       10   33225

在考虑前面的位置时会变得更乱。这是一个查询,用于计算所有文档中第一个唯一字词的出现次数(除非后跟第二个唯一字词)。首先,我创建一个新列,其中包含前面一个单词,然后键入两个单词。

It gets messier when taking the position ahead into account. This is a query to count the occurrences of the first unique word across all documents unless it is followed by the second unique word. First I create a new column containing the word one position ahead and then key on both words.

setkey(DT,docindex,position)
DT[,lead_wordindex:=DT[list(docindex,position+1)][,wordindex]]

         wordindex docindex position lead_wordindex
      1:         1        1        1              1
      2:         1        1        2              3
      3:         3        1        3              3
      4:         3        1        4              1
      5:         1        1        5              2
     ---                                           
 999996:         2       10    99811              2
 999997:         2       10    99812              3
 999998:         3       10    99813              1
 999999:         1       10    99814              3
1000000:         3       10    99815             NA

setkey(DT,wordindex,lead_wordindex)
countr2<-DT[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]
countr2

    docindex count.1
 1:        1   22301
 2:        2   21835
 3:        3   22490
 4:        4   21830
 5:        5   22218
 6:        6   21914
 7:        7   22370
 8:        8   22265
 9:        9   22211
10:       10   22190

我有一个非常大的数据集,上述查询对于内存分配失败。或者,我们可以通过过滤原始数据集,然后将其合并到所需位置,仅为相关数据子集创建此新列:

I have a very large dataset for which the above query fails for memory allocation. As an alternative, we can create this new column for only the relevant subset of data by filtering the original dataset and then joining it back on the desired position:

setkey(DT,wordindex)
filter<-DT[J(1),list(wordindex,docindex,position)]
filter[,lead_position:=position+1]

        wordindex wordindex docindex position lead_position
     1:         1         1        2    99717         99718
     2:         1         1        3    99807         99808
     3:         1         1        4   100243        100244
     4:         1         1        1        1             2
     5:         1         1        1       42            43
    ---                                                    
332852:         1         1       10    99785         99786
332853:         1         1       10    99787         99788
332854:         1         1       10    99798         99799
332855:         1         1       10    99804         99805
332856:         1         1       10    99814         99815

setkey(DT,docindex,position)
filter[,lead_wordindex:=DT[J(filter[,list(docindex,lead_position)])][,wordindex]]

        wordindex wordindex docindex position lead_position lead_wordindex
     1:         1         1        2    99717         99718             NA
     2:         1         1        3    99807         99808             NA
     3:         1         1        4   100243        100244             NA
     4:         1         1        1        1             2              1
     5:         1         1        1       42            43              1
    ---                                                                   
332852:         1         1       10    99785         99786              3
332853:         1         1       10    99787         99788              3
332854:         1         1       10    99798         99799              3
332855:         1         1       10    99804         99805              3
332856:         1         1       10    99814         99815              3

setkey(filter,wordindex,lead_wordindex)
countr2.1<-filter[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]
countr2.1

    docindex count.1
 1:        1   22301
 2:        2   21835
 3:        3   22490
 4:        4   21830
 5:        5   22218
 6:        6   21914
 7:        7   22370
 8:        8   22265
 9:        9   22211
10:       10   22190

很丑陋,我想。此外,我可能想看到前面多于一个字,需要创建另一列。简单但昂贵的方法是:

Pretty ugly, I think. In addition, I may want to look more than one word ahead, necessitating the creation of yet another column. The easy but costly way is:

setkey(DT,docindex,position)
DT[,lead_lead_wordindex:=DT[list(docindex,position+2)][,wordindex]]

         wordindex docindex position lead_wordindex lead_lead_wordindex
      1:         1        1        1              1                   3
      2:         1        1        2              3                   3
      3:         3        1        3              3                   1
      4:         3        1        4              1                   2
      5:         1        1        5              2                   3
     ---                                                               
 999996:         2       10    99811              2                   3
 999997:         2       10    99812              3                   1
 999998:         3       10    99813              1                   3
 999999:         1       10    99814              3                  NA
1000000:         3       10    99815             NA                  NA

setkey(DT,wordindex,lead_wordindex,lead_lead_wordindex)
countr23<-DT[J(1,2,3),list(count.1=.N),by=docindex]
countr23

    docindex count.1
 1:        1    3684
 2:        2    3746
 3:        3    3717
 4:        4    3727
 5:        5    3700
 6:        6    3779
 7:        7    3702
 8:        8    3756
 9:        9    3702
10:       10    3744

但是,由于大小,我目前必须使用丑陋的过滤器和连接方式。

However, I currently have to use the ugly filter-and-join way because of size.

所以问题是,有更容易和更美观的方式吗?

So the question is, is there an easier and more beautiful way?

UPDATE

感谢Arun和eddi解决问题的干净和简单的代码。在我的〜200M行数据,这个解决方案工作在一个简单的组合的单词在大约10秒,这是相当不错!

Thanks to Arun and eddi for clean and simple code that solves the problem. On my ~200M row data, this solution works on a simple combination of words in about 10 seconds, which is quite good!

我有一个附加的问题,这使得矢量扫描方法不是最佳的。虽然在示例中,我只寻找一个单词组合,在实践中,我可能有一个单词的向量来寻找在每个位置。当我将==语句更改为%in%(向量为100个字或更多)时,查询需要更长时间才能运行。所以我仍然会对二进制搜索解决方案感兴趣,如果一个存在。

I do have an added issue, however, that makes the vector scan approach less than optimal. Although in the example I am looking for only one word combination, in practice I may have a vector of words to look for in each position. When I change the "==" statements to "%in%" for that purpose (vectors of 100 words or more), the query takes much longer to run. So I would still be interested in a binary search solution if one exists. However, if Arun doesn't know of one, it might not, and I would happily accept his answer.

推荐答案

这里是另一个想法刚刚在我的头脑。

Here's another idea that just sprung to my mind. It requires just creating one more column and uses binary search for subset.

DT c>您已根据数据生成,首先我们将添加额外的列:

On the DT you've generated from your data, first we'll add the extra column:

# the extra column:
DT[, I := .I]

$ c> setkey docindex wordindex 。这是我们可以集成而不创建额外列的唯一方法(至少我可以想到的)。因此,我们需要一种方法来提取原始位置,以检查您的条件(因此 I )。

We need this because we'll setkey on docindex and wordindex. This is the only way we can subset without creating extra columns (at least what I could think of). So, we'll need a way to extract the "original" positions later to check for your condition (hence the I).

添加额外的列后,让我们设置上面两列的键:

After adding the extra column, let's set the key on the two columns mentioned above:

setkey(DT, docindex, wordindex)

从这里的想法是提取位置你想要的词匹配 - 这里的值是 1L 。然后,提取所有其他词你可能(或可能不)想在这个词之后来到正确的位置。然后,我们只保留(或删除)满足条件的那些索引。

Great! The idea from here is to extract the positions where your desired word matches - here that value is 1L. Then, extract all the other words you may (or may not) want to come after this word at the right position. Then, we simply keep (or remove) those indices that satisfy the condition.

这里有一个函数来处理这个。

Here's a function that'll take care of this. It is by no means complete, but should give you an idea.

foo <- function(DT, doc_key, word_key, rest_key=NULL, match=FALSE) {
    ## note that I'm using 1.9.3, where this results in a vector
    ## if you're using 1.9.2, you'll have to change the joins accordingly
    idx1 = DT[J(doc_key, word_key), I]
    for (i in seq_along(rest_key)) {
        this_key = rest_key[i]
        idx2 = DT[J(doc_key, this_key), I]
        if (match) idx1 = idx1[which((idx1+i) %in% idx2)]
        else idx1 = idx1[which(!(idx1+i) %in% idx2)]
    }
    DT[idx1, .N, by=c(key(DT)[1L])]
}

这里, DT 数据。添加了 I 列的表,然后 setkey

Here, DT is the data.table to which I column has been added, and then setkey has been called on the two columns as mentioned before.

doc_key 基本上包含所有的 docindex - 这里是1:10。 word_key 基本上是1L。 rest_key 是您希望检查的值不会出现在 i 位置之后 word_key

doc_key basically contains all the unique values in docindex - here 1:10. word_key is basically 1L here. rest_key is the values you'd like to check does not occur at ith position after the position of word_key.

首先我们提取 I c $ c> 1L idx1 (简单)。接下来,我们循环遍历 rest_key 中的每个值,并将该位置添加到 idx1 = idx1 + i 并检查该值是否出现在 idx2 中。如果是,根据您是否喜欢提取匹配不匹配的条目,我们会保留(或删除)。

First we extract I for all matches of 1L in idx1 (straightforward). Next, we loop through each value of rest_key and add that position to idx1 = idx1+i and check if that value occurs in idx2. If so, based of whether you like to extract matching or non-matching entries, we'll keep (or remove them).

在这个循环结束时, idx1 应该只有所需的条目。希望这可以帮助。下面显示的是另一个答案中已经讨论过的案例。

And at the end of this loop, idx1 should have only the desired entries. Hope this helps. Shown below is a demonstration of the cases already discussed in the other answer.

让我们考虑第一种情况。所有条目的计数,对于 docindex 中的每个组,其中第i个位置为 1L i + 1 不是 2L。这基本上是:

Let's consider your first scenario. Count of all entries, for each group in docindex where ith position is 1L and i+1th is not 2L. This is basically:

system.time(ans1 <- foo(DT, 1:10, 1L, 2L, FALSE))

#  user  system elapsed 
# 0.066   0.019   0.085 

# old method took 0.12 seconds

#     docindex     N
#  1:        1 22301
#  2:        2 21836
#  3:        3 22491
#  4:        4 21831
#  5:        5 22218
#  6:        6 21914
#  7:        7 22370
#  8:        8 22265
#  9:        9 22211
# 10:       10 22190






第二种情况如何?这里,我们希望 i + 1 th和 i + 2 th位置为2L和3L,而不是在先前情况中的不等于情形。因此,我们在此处设置 match = TRUE


What about the second scenario? Here, we'd like the the i+1th and i+2th position to be 2L and 3L, as opposed to the not equal scenario in the earlier case. So, we set match=TRUE here.

system.time(ans2 <- foo(DT, 1:10, 1L, 2:3,TRUE))
#  user  system elapsed 
# 0.080   0.011   0.090 

# old method took 0.22 seconds

#     docindex    N
#  1:        1 3684
#  2:        2 3746
#  3:        3 3717
#  4:        4 3727
#  5:        5 3700
#  6:        6 3779
#  7:        7 3702
#  8:        8 3756
#  9:        9 3702
# 10:       10 3744

这个功能很容易展开。例如:如果您希望 i + 1 等于 2L ,但 i + 2 不等于 3L ,那么您可以更改 match 是一个vector = length(rest_key)指定相应的逻辑值。

It's easy to expand this function. For ex: if you'd like to have i+1th to be equal to 2L but i+2th not equal to 3L then, you can change match to be a vector = length(rest_key) specifying corresponding logical values.

这对您的实际情况很快 - 至少比其他版本快。

I hope this is fast for your actual case - at least faster than the other version.

HTH

这篇关于有效检查data.table中其他行的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆