字符串匹配到不同大小的数据帧 [英] String matching to data.frames of different sizes

查看:61
本文介绍了字符串匹配到不同大小的数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个不同大小的data.frame,我正在寻找将字符串从一个data.frame匹配到另一个,并提取一些相关信息的最有效方法。

I have two data.frames of different sizes and I'm looking for the most efficient way to match strings from one data.frame to another, and extract some relevant information.

这里是一个示例:

两个初始data.frame,a和b,以及所需的结果:

Two initial data.frames, a and b, and the desired result:

a = data.frame(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
               age = c(30, 24, 52, 44, 73, 44, 33, 12),
               visits = c(5, 1, 3, 2, 8, 5, 19, 3))

b = data.frame(string = c("the red ball went over the fence",
                          "sorry to see that your tent fell down",
                          "the ball fell into the red salad",
                          "serious people eat peanuts on Sundays"))

desired_result = data.frame(string = b$string,
                            num_matches = c(2, 1, 3, 0),
                            avg_age = c(37, 73, 32.66667, NA),
                            avg_visits = c(3.5, 8, 2.66667, NA))

以下是数据。

> a
   term age visits
1   red  30      5
2 salad  24      1
3  rope  52      3
4  ball  44      2
5  tent  73      8
6 plane  44      5
7  gift  33     19
8  meat  12      3

> b
                                 string
1      the red ball went over the fence
2 sorry to see that your tent fell down
3      the ball fell into the red salad
4 serious people eat peanuts on Sundays

> desired_result
                                 string num_matches  avg_age avg_visits
1      the red ball went over the fence           2 37.00000    3.50000
2 sorry to see that your tent fell down           1 73.00000    8.00000
3      the ball fell into the red salad           3 32.66667    2.66667
4 serious people eat peanuts on Sundays           0       NA         NA




  • num_matches是字符串中术语的数量

  • avg_age是在字符串中发现的术语的平均年龄

  • avg_visits是在字符串中找到的术语的平均访问次数。

  • 关于如何实施的任何想法

    Any ideas on how to implement this in an efficient way?

    谢谢。

    推荐答案

    使用 data.table ,用 by = string 处理每一行。将匹配结果保存在列表中,然后按匹配结果进行汇总。

    Use data.table, process each row with by = string. save the match results in a list, then aggregate by the match results.

    请注意匹配项列是一个列表列表,每个单元格包含一个列表。您需要用。()包装匹配结果,这实际上是另一个 list(),因为data.table需要一个列表

    Note the matches column is a list of list, each cell holding a list. You need wrap the match results with .() which is actually another list() because data.table expect a list for normal columns.

    library(data.table)
    library(stringr)
    a = data.table(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
      age = c(30, 24, 52, 44, 73, 44, 33, 12),
      visits = c(5, 1, 3, 2, 8, 5, 19, 3))
    b = data.table(string = c("the red ball went over the fence",
      "sorry to see that your tent fell down",
      "the ball fell into the red salad",
      "serious people eat peanuts on Sundays"))
    
    b[, matches := vector("list", .N)]
    b[, matches := .(list(str_detect(string, a[, term]))), by = string]
    b[, num_matches := sum(unlist(matches)), by = string]
    b[, avg_age := mean(a[unlist(matches), age]), by = string]
    b[, avg_visits := mean(a[unlist(matches), visits]), by = string]
    

    这篇关于字符串匹配到不同大小的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆