R-多条件搜索 [英] R - multiple criteria search

查看:74
本文介绍了R-多条件搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到以下问题,希望您能对我有所帮助:
我有一个庞大的数据库(我无法透露),但是它的结构如下:

I have the following issue and I hope you may help me: I have a huge database (which I cannot disclose) but is it is structured as follows:


  • 500万个观测值

  • 在这种情况下,其中三个相关变量为7个:

  • 代码ID购买

  • 代码ID出售

  • 日期

  • 5 million observations
  • 7 variables of which three of interest in this case:
  • Code ID Buy
  • Code ID Sell
  • Date

I想要另一个名为 new 的变量,该变量在 i 0 c $ c>如果:

I would like another variable called new, which takes the value 0 in line i if:


  • 存在观测值 k ,且观测值 Code_IB_Buy_ [i] = Code_IB_Buy_ [k] Code_IB_Sell_ [i] = Code_IB_Sell_ [k] Date [i] Date [k]
    之后,如果没有,我想 new [i] = 1

  • there exists an observation k, with Code_IB_Buy_[i]=Code_IB_Buy_[k] and Code_IB_Sell_[i]=Code_IB_Sell_[k] and Date[i] is after Date[k] if not, I would like new[i]=1.

基本上,如果买卖双方之前没有交易, i 行中的第一个匹配项,因此变量 new 取值 1

Basically if there was no transaction before between the buyer and the seller, this is the first occurrence in line i so the variable new takes the value 1.

到目前为止,我一直在尝试按日期和买方对数据库(作为data.table)进行排序。然后,我使用一个For Loop,它可以在其他语言中使用,但是在这里由于某种原因它不能使用。

What I have tried until now, is to sort the database (as a data.table) by date and by buyer. Then I use a For Loop, which would work in other languages but here it does not for some reason.

编辑:精简的样本如下所示:

A reduced sample would look like this:

    library(data.table)
set.seed(1)
Data <- data.frame(
  Month = c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,3,4,5),
  Amount <- rnorm(21,mean=100,sd=20),
  Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
  Code_ID_Sell = c("98C","99C","98C","99C","98C","99C","96V","98C","99C","96V","98C","99C","96V","94D","98C","99C","96V","94D","25A","25A","25A"),
  Distancewithlastr1 = c(NA,NA,NA,NA,NA,NA,NA,1,1,1,2,2,2,2,1,1,1,1,0,1,2),
LastRelationshipseller = c("98C","98C","98C","98C","98C","98C","98C","96V","96V","96V","96V","96V","96V","96V","94D","94D","94D","94D",NA,"25A","25A")
    )

Data$new<-0

setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := {
  r <- rowid(Code_ID_Buy, Code_ID_Sell)
  +(r==1L)
}]

Data[Month==1L, new:=0L]

    View(Data)

自从我使用以来,数据已经被排序了(并且可以正常工作):
Data <-Data [order(month),]
Data <-Data [order(Code_ID_Buy),]

The data is already ordered since I used (and it works fine) : Data<-Data[order(month),] Data<-Data[order(Code_ID_Buy),]

但是,我现在想做的是引入一个新列 new,当且仅当R找不到ID_Buy和ID_Sell相同的观察值时,该列的值为1给定日期(新客户关系)

However, what I would like to do now is to introduce the new column "new" that takes value 1 if and if only R does not find any observation where ID_Buy and ID_Sell are the same before the given date (new customer relationship)

在上面的示例中,第7、14和19行应标记为1,但其他都不标记。

In the example above, line 7,14 and 19 should be marked with 1 but none of the others.

在第二步中,我想计算第一和第二个关系之间的月数(列距离,其取值为月的新关系-最近一次新关系的月份,否则位于第14行,它将取值为 2,5-3,但这已经超越了我自己。)

In a second step, I would like to calculate the number of months between the first and second relationship (a column "distance", which would take the value "month of new relationship"-"month of last new relationship" else, that is in line 14, it would take the value "2",5-3, but this is getting ahead of myself).

这是足够的信息吗?

编辑:@ chinsoon12非常感谢,您真的很好!这似乎像那样工作(当new = 0时我实际上不想要NA,所以我通过删除new == 1L来更改了公式,但它不起作用,因为new = 0的所有行的距离都是1,因为它计算时间不是最后一个新的== 1L,而是最后一个新的== 0L)

EDIT : @chinsoon12 thank you a lot, you are really good ! this seem to work like that(I actually didn't want NA when new = 0, so I changed your formula by removing new==1L and it does not work because the distance for all lines with new=0 is 1 because it calculates the time not with the last new == 1L but the distance with the last new==0L)

但是我还有一个小问题:我用

I have also however a small question : I used

Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
    by=.EACHI, i.Month - x.Month]$V1], but is it possible to know which was the Code_ID_Sell for the line which serves as basis (in line "x" from what I understand)  for each i?


推荐答案

一个选项是按月份, Code_ID_Buy Code_ID_Sell ,然后在每组 Code_ID_Buy Code_ID_Sell 到1:

An option is to order by Month, Code_ID_Buy and Code_ID_Sell and then set the first row in each group of Code_ID_Buy and Code_ID_Sell to 1:

library(data.table)
setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), 
    new := +(rowid(Code_ID_Buy, Code_ID_Sell) == 1L)]

您的需求基本上转化为找到Code_ID_Buy和Code_ID_Sell的第一个组合,这是使用<$ c的另一种选择$ c>重复:

Your requirement basically translates into finding the first combi of Code_ID_Buy and Code_ID_Sell, here is another option using duplicated:

setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell),
    new := !duplicated(.SD), .SDcols=c("Code_ID_Buy", "Code_ID_Sell")]

输出:

    Month    Amount Code_ID_Buy Code_ID_Sell new
 1:     1  87.47092        100D          98C   1
 2:     1 103.67287        100D          99C   1
 3:     2  83.28743        100D          98C   0
 4:     2 131.90562        100D          99C   0
 5:     3 106.59016        100D          98C   0
 6:     3  83.59063        100D          99C   0
 7:     3 109.74858        100D          96V   1
 8:     4 114.76649        100D          98C   0
 9:     4 111.51563        100D          99C   0
10:     4  93.89223        100D          96V   0
11:     5 130.23562        100D          98C   0
12:     5 107.79686        100D          99C   0
13:     5  87.57519        100D          96V   0
14:     5  55.70600        100D          94D   1
15:     6 122.49862        100D          98C   0
16:     6  99.10133        100D          99C   0
17:     6  99.67619        100D          96V   0
18:     6 118.87672        100D          94D   0
19:     3 116.42442        102D          25A   1
20:     4 111.87803        102D          25A   0
21:     5 118.37955        102D          25A   0
    Month    Amount Code_ID_Buy Code_ID_Sell new

此解决方案的前几行有1个同样,因为在它们之前没有行具有相同的Code_ID_Buy或Code_ID_Sell。如果有必要删除它们,则可以使用 Data [Month == 1L,new:= 0L]

This solution has 1 for the first few rows as well since there are no rows prior to them with the same Code_ID_Buy or Code_ID_Sell. If it is necessary to remove them, you can use Data[Month==1L, new := 0L]

对于第二个问题,您可以使用非等号联接在同一Code_ID_Buy上找到Month在当前一个月之前和new = 1L之前的行:

For the 2nd question, you can use a non-equi join to find rows where Month is before current one and new=1L for the same Code_ID_Buy:

Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
    by=.EACHI, i.Month - x.Month]$V1]

输出:

    Month    Amount Code_ID_Buy Code_ID_Sell new distance
 1:     1  87.47092        100D          98C   1       NA
 2:     1 103.67287        100D          99C   1       NA
 3:     2  83.28743        100D          98C   0       NA
 4:     2 131.90562        100D          99C   0       NA
 5:     3 106.59016        100D          98C   0       NA
 6:     3  83.59063        100D          99C   0       NA
 7:     3 109.74858        100D          96V   1        2
 8:     4 114.76649        100D          98C   0       NA
 9:     4 111.51563        100D          99C   0       NA
10:     4  93.89223        100D          96V   0       NA
11:     5 130.23562        100D          98C   0       NA
12:     5 107.79686        100D          99C   0       NA
13:     5  87.57519        100D          96V   0       NA
14:     5  55.70600        100D          94D   1        2
15:     6 122.49862        100D          98C   0       NA
16:     6  99.10133        100D          99C   0       NA
17:     6  99.67619        100D          96V   0       NA
18:     6 118.87672        100D          94D   0       NA
19:     3 116.42442        102D          25A   1       NA
20:     4 111.87803        102D          25A   0       NA
21:     5 118.37955        102D          25A   0       NA
    Month    Amount Code_ID_Buy Code_ID_Sell new distance






对于您的第三个qn,您可以尝试:


for your 3rd qn, you can try:

Data[, dlr := if (k>0L) rleid(Month) - 1L, .(Code_ID_Buy, k=cumsum(new))] 

输出:

    Month    Amount Code_ID_Buy Code_ID_Sell Distancewithlastr1 LastRelationshipseller new distance dlr
 1:     1  87.47092        100D          98C                 NA                    98C   0       NA  NA
 2:     1 103.67287        100D          99C                 NA                    98C   0       NA  NA
 3:     2  83.28743        100D          98C                 NA                    98C   0       NA  NA
 4:     2 131.90562        100D          99C                 NA                    98C   0       NA  NA
 5:     3 106.59016        100D          98C                 NA                    98C   0       NA  NA
 6:     3  83.59063        100D          99C                 NA                    98C   0       NA  NA
 7:     3 109.74858        100D          96V                  0                    98C   1       NA   0
 8:     4 114.76649        100D          98C                  1                    96V   0       NA   1
 9:     4 111.51563        100D          99C                  1                    96V   0       NA   1
10:     4  93.89223        100D          96V                  1                    96V   0       NA   1
11:     5 130.23562        100D          98C                  2                    96V   0       NA   2
12:     5 107.79686        100D          99C                  2                    96V   0       NA   2
13:     5  87.57519        100D          96V                  2                    96V   0       NA   2
14:     5  55.70600        100D          94D                  2                    96V   1        2   0
15:     6 122.49862        100D          98C                  1                    94D   0       NA   1
16:     6  99.10133        100D          99C                  1                    94D   0       NA   1
17:     6  99.67619        100D          96V                  1                    94D   0       NA   1
18:     6 118.87672        100D          94D                  1                    94D   0       NA   1
19:     3 116.42442        102D          25A                  0                   <NA>   1       NA   0
20:     4 111.87803        102D          25A                  1                    25A   0       NA   1
21:     5 118.37955        102D          25A                  2                    25A   0       NA   2
    Month    Amount Code_ID_Buy Code_ID_Sell Distancewithlastr1 LastRelationshipseller new distance dlr

ps:欢迎来到SO。只是让您知道,发布一个新的问题而不是在先前的问题上添加一个新的问题可能是不错的选择,此外,您还可以让其他人也回答。

这篇关于R-多条件搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆