字符串匹配到不同大小的数据帧 [英] String matching to data.frames of different sizes
问题描述
我有两个不同大小的data.frame,我正在寻找将字符串从一个data.frame匹配到另一个,并提取一些相关信息的最有效方法。
I have two data.frames of different sizes and I'm looking for the most efficient way to match strings from one data.frame to another, and extract some relevant information.
这里是一个示例:
两个初始data.frame,a和b,以及所需的结果:
Two initial data.frames, a and b, and the desired result:
a = data.frame(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
age = c(30, 24, 52, 44, 73, 44, 33, 12),
visits = c(5, 1, 3, 2, 8, 5, 19, 3))
b = data.frame(string = c("the red ball went over the fence",
"sorry to see that your tent fell down",
"the ball fell into the red salad",
"serious people eat peanuts on Sundays"))
desired_result = data.frame(string = b$string,
num_matches = c(2, 1, 3, 0),
avg_age = c(37, 73, 32.66667, NA),
avg_visits = c(3.5, 8, 2.66667, NA))
以下是数据。
> a
term age visits
1 red 30 5
2 salad 24 1
3 rope 52 3
4 ball 44 2
5 tent 73 8
6 plane 44 5
7 gift 33 19
8 meat 12 3
> b
string
1 the red ball went over the fence
2 sorry to see that your tent fell down
3 the ball fell into the red salad
4 serious people eat peanuts on Sundays
> desired_result
string num_matches avg_age avg_visits
1 the red ball went over the fence 2 37.00000 3.50000
2 sorry to see that your tent fell down 1 73.00000 8.00000
3 the ball fell into the red salad 3 32.66667 2.66667
4 serious people eat peanuts on Sundays 0 NA NA
- num_matches是字符串中术语的数量
- avg_age是在字符串中发现的术语的平均年龄
- avg_visits是在字符串中找到的术语的平均访问次数。
关于如何实施的任何想法
Any ideas on how to implement this in an efficient way?
谢谢。
推荐答案
使用 data.table
,用 by = string
处理每一行。将匹配结果保存在列表中,然后按匹配结果进行汇总。
Use data.table
, process each row with by = string
. save the match results in a list, then aggregate by the match results.
请注意匹配项
列是一个列表列表,每个单元格包含一个列表。您需要用。()
包装匹配结果,这实际上是另一个 list()
,因为data.table需要一个列表
Note the matches
column is a list of list, each cell holding a list. You need wrap the match results with .()
which is actually another list()
because data.table expect a list for normal columns.
library(data.table)
library(stringr)
a = data.table(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
age = c(30, 24, 52, 44, 73, 44, 33, 12),
visits = c(5, 1, 3, 2, 8, 5, 19, 3))
b = data.table(string = c("the red ball went over the fence",
"sorry to see that your tent fell down",
"the ball fell into the red salad",
"serious people eat peanuts on Sundays"))
b[, matches := vector("list", .N)]
b[, matches := .(list(str_detect(string, a[, term]))), by = string]
b[, num_matches := sum(unlist(matches)), by = string]
b[, avg_age := mean(a[unlist(matches), age]), by = string]
b[, avg_visits := mean(a[unlist(matches), visits]), by = string]
这篇关于字符串匹配到不同大小的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!