Vectorize data.table like,grepl或者类似的大数据字符串比较 [英] Vectorize data.table like, grepl, or similar for big data string comparison

查看:107
本文介绍了Vectorize data.table like,grepl或者类似的大数据字符串比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要检查一列中的字符串是否包含所有行的另一列的相同行的对应(数值)值。



如果我是只检查单个模式的字符串,这将是直接使用data.table的 grepl 。但是,我的模式值对于每一行都是不同的。



有一些有关的问题这里,但是不同于这个问题,我需要创建一个逻辑标志,指示模式是否存在。



让我们假设这是我的数据集;

  DT<  -  structure(list(category = c(administration, ,
卡车,管理,仓库,仓库,卡车运输,
护士,护士),industry = c b $ b卡车,卡车,管理员,护士,管理员,卡车,护士,
卡车)).Names = c industry),class =data.frame,row.names = c(NA,
-9L))
setDT(DT)
> DT
类别行业
1:行政管理
2:护士车
3:卡车
4:行政管理
5:仓库护士
6:仓库管理
7:卡车
8:护士护士
9:护士车

我想要的结果将是这样的向量:

  DT 
匹配
1:TRUE
2:FALSE
3:TRUE
4:TRUE
5:FALSE
6:FALSE
7:TRUE
8:TRUE
9:FALSE

,1和0的值将与TRUE和FALSE一样好。



这里有一些我没有用的东西:

  apply(DT,1,grepl,pattern = DT [,2],x = DT [,1])$ ​​b $ b [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 

>应用(DT,1,grepl,pattern = DT [,1],x = DT [,2])
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> grepl(DT [,2],DT [,1])$ ​​b $ b [1] FALSE

> DT [Vectorize(grepl)(行业,类别,固定= TRUE)]
类别行业
1:行政管理
2:卡车运输
3:行政管理
4:卡车卡车
5:护士护士

> DT [stringi :: stri_detect_fixed(category,industry)]
类别行业
1:行政管理
2:卡车运输
3:行政管理
4:
5:护士护士

>对于(i in 1:nrow(DT)){print(grepl(DT [i,2],DT [i,1]))}
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE

> for(i in 1:nrow(DT)){print(grepl(DT [i,2],DT [i,1],fixed = T))}
[1] FALSE
[1 ] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE

> DT [category%like%industry]
category行业
1:管理admin
2:管理admin
警告消息:
在grepl(pattern,vector) b $ b参数'pattern'has length> 1,只使用第一个元素


解决方案

OP的代码,没有使用。因此,基于 data.table 方法,它将对应于 i 索引的行进行子集。 p>

但是,如果我们指定,我们正在使用 j ,我们得到逻辑向量结果

  DT [,stri_detect_fixed(category,industry)] 
#[1] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE

假设我们将它保存在 list ,那么我们得到一个 data.table 和一个列

  DT [,list(match = stri_detect_fixed(category,industry))] 


I need to check if a string in one column contains a corresponding (numeric) value from the same row of another column, for all rows.

If I were only checking the string for a single pattern this would be straightforward using data.table's like or grepl. However, my pattern value is different for every row.

There's a somewhat related question here, but unlike that question I need to create a logical flag indicating if the pattern was present.

Let's say this is my dataset;

DT <- structure(list(category = c("administration", "nurse practitioner", 
                                  "trucking", "administration", "warehousing", "warehousing", "trucking", 
                                  "nurse practitioner", "nurse practitioner"), industry = c("admin", 
                                                                                            "truck", "truck", "admin", "nurse", "admin", "truck", "nurse", 
                                                                                            "truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                               -9L))
setDT(DT)
> DT
             category industry
1:     administration    admin
2: nurse practitioner    truck
3:           trucking    truck
4:     administration    admin
5:        warehousing    nurse
6:        warehousing    admin
7:           trucking    truck
8: nurse practitioner    nurse
9: nurse practitioner    truck

My desired result would be a vector like this:

> DT
   matches
1: TRUE
2: FALSE
3: TRUE
4: TRUE
5: FALSE
6: FALSE
7: TRUE
8: TRUE
9: FALSE

Of course, 1's and 0's would be just as good as TRUE and FALSE.

Here are some things I tried that didn't work:

apply(DT,1,grepl, pattern = DT[,2], x = DT[,1])
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> apply(DT,1,grepl, pattern = DT[,1], x = DT[,2])
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> grepl(DT[,2], DT[,1])
[1] FALSE

> DT[Vectorize(grepl)(industry, category, fixed = TRUE)]
             category industry
1:     administration    admin
2:           trucking    truck
3:     administration    admin
4:           trucking    truck
5: nurse practitioner    nurse

> DT[stringi::stri_detect_fixed(category, industry)]
             category industry
1:     administration    admin
2:           trucking    truck
3:     administration    admin
4:           trucking    truck
5: nurse practitioner    nurse

> for(i in 1:nrow(DT)){print(grepl(DT[i,2], DT[i,1]))}
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE

> for(i in 1:nrow(DT)){print(grepl(DT[i,2], DT[i,1], fixed = T))}
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE

> DT[category %like% industry]
         category industry
1: administration    admin
2: administration    admin
Warning message:
In grepl(pattern, vector) :
  argument 'pattern' has length > 1 and only the first element will be used

解决方案

In the OP's code, the , was not used. So, based on the data.table method, it will subset the rows that corresponds to the i index.

But, if we are specifying the , we are playing with the j and we get the logical vector as a result

DT[, stri_detect_fixed(category, industry)]
#[1]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE

Suppose, we keep it in a list, then we get a data.table with a column

DT[, list(match=stri_detect_fixed(category, industry))]

这篇关于Vectorize data.table like,grepl或者类似的大数据字符串比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆