查找相似的字符串并在一个数据帧内协调它们 [英] Find similar strings and reconcile them within one dataframe

查看:78
本文介绍了查找相似的字符串并在一个数据帧内协调它们的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为我的初学者,还有一个问题.在这里考虑以下示例:

Another question for me as a beginner. Consider this example here:

n = c(2, 3, 5)
s = c("ABBA", "ABA", "STING")
b = c(TRUE, "STING", "STRING")
df = data.frame(n,s,b)

n     s      b
1 2  ABBA   TRUE
2 3   ABA  STING
3 5 STING STRING

如何在此数据帧中搜索相似的字符串,即ABBA和ABA以及STING和STRING,并使它们相同(无论ABBA还是ABA都可以,都可以),不需要我知道任何变化?我的实际data.frame非常大,因此不可能知道所有不同的变化.

How can I search within this dataframe for similar strings, i.e. ABBA and ABA as well as STING and STRING and make them the same (doesn't matter whether ABBA or ABA, either fine) that would not require me knowing any variations? My actual data.frame is very big so that it would not be possible to know all the different variations.

我希望返回类似这样的内容:

I would want something like this returned:

> n = c(2, 3, 5)
> s = c("ABBA", "ABBA", "STING")
> b = c(TRUE, "STING", "STING")
> df = data.frame(n,s,b)

> print(df)
  n     s     b
1 2  ABBA  TRUE
2 3  ABBA STING
3 5 STING STING

我到处寻找agrep或stringdist,但是它们引用两个data.frame或能够命名该列,因为我有很多,所以我不能. 有人有主意吗?非常感谢! 最好的祝福, 斯特菲

I have looked around for agrep, or stringdist, but those refer to two data.frames or are able to name the column which I can't since I have many of those. Anyone an idea? Many thanks! Best regards, Steffi

推荐答案

这对我有用,但可能会有更好的解决方案

This worked for me but there might be a better solution

这个想法是使用递归函数special,它使用agrepl,这是近似grep的逻辑版本, https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/agrep .请注意,您可以指定容错"以将相似的字符串与agrep分组.使用agrepl,我将具有相似字符串的行拆分为x,将s列的mutate列为第一个出现的字符串,然后添加分组变量grp. ith组中未包括的其余行存储在y中,并递归地通过该函数传递,直到y为空.

The idea is to use a recursive function, special, that uses agrepl, which is the logical version of approximate grep, https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/agrep. Note that you can specify the 'error tolerance' to group similar strings with agrep. Using agrepl, I split off rows with similar strings into x, mutate the s column to the first-occurring string, and then add a grouping variable grp. The remaining rows that were not included in the ith group are stored in y and recursively passed through the function until y is empty.

您需要dplyr软件包install.packages("dplyr")

library(dplyr)

desired <- NULL
grp <- 1
special <- function(x, y, grp) {
                if (nrow(y) < 1) {        # if y is empty return data
                     return(x)
                } else {
                     similar <- agrepl(y$s[1], y$s)      # find similar occurring strings
                     x <- rbind(x, y[similar,] %>% mutate(s=head(s,1)) %>% mutate(grp=grp))
                     y <- setdiff(y, y[similar,])
                     special(x, y, grp+1)
                }
           }

desired <- special(desired,df,grp)

要更改字符串相似性的严格性,请像agrepl(x,y,max.distance=0.5)

To change the stringency of string similarity, change max.distance like agrepl(x,y,max.distance=0.5)

  n     s      b grp
1 2  ABBA   TRUE   1
2 3  ABBA  STING   1
3 5 STING STRING   2

要删除分组变量

withoutgrp <- desired %>% select(-grp)

这篇关于查找相似的字符串并在一个数据帧内协调它们的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆