字符串匹配到不同大小的数据帧 [英] String matching to data.frames of different sizes

查看：61 发布时间：2020/10/17 0:29:54 r string dataframe

本文介绍了字符串匹配到不同大小的数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个不同大小的data.frame，我正在寻找将字符串从一个data.frame匹配到另一个，并提取一些相关信息的最有效方法。

I have two data.frames of different sizes and I'm looking for the most efficient way to match strings from one data.frame to another, and extract some relevant information.

这里是一个示例：

两个初始data.frame，a和b，以及所需的结果：

Two initial data.frames, a and b, and the desired result:

a = data.frame(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
               age = c(30, 24, 52, 44, 73, 44, 33, 12),
               visits = c(5, 1, 3, 2, 8, 5, 19, 3))

b = data.frame(string = c("the red ball went over the fence",
                          "sorry to see that your tent fell down",
                          "the ball fell into the red salad",
                          "serious people eat peanuts on Sundays"))

desired_result = data.frame(string = b$string,
                            num_matches = c(2, 1, 3, 0),
                            avg_age = c(37, 73, 32.66667, NA),
                            avg_visits = c(3.5, 8, 2.66667, NA))

以下是数据。

> a
   term age visits
1   red  30      5
2 salad  24      1
3  rope  52      3
4  ball  44      2
5  tent  73      8
6 plane  44      5
7  gift  33     19
8  meat  12      3

> b
                                 string
1      the red ball went over the fence
2 sorry to see that your tent fell down
3      the ball fell into the red salad
4 serious people eat peanuts on Sundays

> desired_result
                                 string num_matches  avg_age avg_visits
1      the red ball went over the fence           2 37.00000    3.50000
2 sorry to see that your tent fell down           1 73.00000    8.00000
3      the ball fell into the red salad           3 32.66667    2.66667
4 serious people eat peanuts on Sundays           0       NA         NA

num_matches是字符串中术语的数量

avg_age是在字符串中发现的术语的平均年龄

avg_visits是在字符串中找到的术语的平均访问次数。

关于如何实施的任何想法

Any ideas on how to implement this in an efficient way?

谢谢。

推荐答案

使用 data.table ，用 by = string 处理每一行。将匹配结果保存在列表中，然后按匹配结果进行汇总。

Use data.table, process each row with by = string. save the match results in a list, then aggregate by the match results.

请注意匹配项列是一个列表列表，每个单元格包含一个列表。您需要用。（）包装匹配结果，这实际上是另一个 list（），因为data.table需要一个列表

Note the matches column is a list of list, each cell holding a list. You need wrap the match results with .() which is actually another list() because data.table expect a list for normal columns.

library(data.table)
library(stringr)
a = data.table(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
  age = c(30, 24, 52, 44, 73, 44, 33, 12),
  visits = c(5, 1, 3, 2, 8, 5, 19, 3))
b = data.table(string = c("the red ball went over the fence",
  "sorry to see that your tent fell down",
  "the ball fell into the red salad",
  "serious people eat peanuts on Sundays"))

b[, matches := vector("list", .N)]
b[, matches := .(list(str_detect(string, a[, term]))), by = string]
b[, num_matches := sum(unlist(matches)), by = string]
b[, avg_age := mean(a[unlist(matches), age]), by = string]
b[, avg_visits := mean(a[unlist(matches), visits]), by = string]

这篇关于字符串匹配到不同大小的数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

字符串匹配到不同大小的数据帧 [英] String matching to data.frames of different sizes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

字符串匹配到不同大小的数据帧 [英] String matching to data.frames of different sizes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭