在R中找到匹配的字符串组 [英] Find matching groups of strings in R

查看:329
本文介绍了在R中找到匹配的字符串组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约8000个字符串的向量.向量中的每个元素都是公司名称.

I have a vector of about 8000 strings. Each element in the vector is a company name.

我的目标

我的目标是将这些公司名称分组为一组,以便每个群集包含一组彼此相似的公司名称(例如:ROYAL DUTCH SHELL,SHELL USA,BMCC SHELL等将属于相同的组/集群,因为它们都是基于Shell的公司,即名称中带有"Shell"一词.

My objective is to cluster these company names into groups, so that each cluster contains a group of company names that are similar to each other (For example: ROYAL DUTCH SHELL, SHELL USA, BMCC SHELL etc... will belong to the same group/cluster, as they are all Shell-based companies i.e. they have the word 'Shell' in their names).

在处理这种大小的向量时,似乎要花很多时间才能使用我采用的聚类技术来查找具有相似公司名称的组.但是,对于较小的向量,此方法效果很好.

When dealing with a vector of this size, it seems to be taking forever to find groups of similar company names using the clustering technique I've employed. However on smaller vectors, this method works well.

让我用一个示例公司名称矢量来演示我的方法,该公司矢量比原始公司矢量小得多.

Let me demonstrate my approach using an example vector of company names, which is much smaller than the original one.

使用很小的字符串向量,这种方法效果很好.

With a small vector of strings, this approach works very well.

向量看起来像这样

string=c("ROYAL DUTCH SHELL","Kremlin Prestige","Bateaux Mouches","Red Square Kremlin Inc","SHELL USA","KLM NEDERLAND","KLM GROUP","SHELL AUSTRALIA","BP TANGUH","LEROY MERLIN","SHELL AZERBAIJAN","BMCC SHELL",
     "GAS PLANT BERLIN","SHELL AQUA MARINA","AUCHAN LEROY","GROUPE ROYAL TANGUH","klm hostel","SHELL","TANGUH TOWN","KPMG")

我的尝试

为了解决这个问题,我使用了分层聚类方法.

In order to tackle this problem, I used a hierarchical clustering method.

# load packages
pacman::p_load(stringdist, dplyr, tm, gplots)

但是先做好一些准备工作

But some prep work first

#Function to clean strings
str_clean <- function(strings) {
 require(dplyr)
 require(tm)
 strings %>% tolower() %>% removePunctuation() %>% stripWhitespace() %>% 
 trim()
}

# Clean company names
clean_names = str_clean(string)

n = length(clean_names)

现在可以计算单词之间的距离,用于聚类

Now to calculate the distances between words, to be used for clustering

# Distance methods
methods <- c("lcs", "osa", "cosine")
q <- c(0, 0, 3)  #size of q-gram

dist.methods <- list()

# create distance matrix for every pair of listing, for each method
for (m in 1:length(methods)) {
dist = matrix(NA, ncol = n, nrow = n)  #initialize empty matrix
# row.names(dist) = prods
for (i in 1:n) {
for (j in 1:n) {
  dist[i, j] <- stringdist(clean_names[i], clean_names[j], method = methods[m], 
                           q = q[m])
}
}
 dist.methods[[m]] <- dist
 }

距离计算完成后,我选择一种方法并设置适当的截止值

Once the distance calculations are done, I choose a method and set an appropriate cut-off

#hierarchical clustering with cut-off of 0.2
clusters <- hclust(as.dist(dist.methods[[3]]))
plot(clusters)
df=as.data.frame(cbind("Companies" = clean_names, "Cluster" = cutree(clusters, h = .99)))

结果数据框将所有公司名称归类为集群,就像我想要的那样.

The resulting dataframe has all the company names categorized into clusters, just like I wanted.

df=df %>% group_by(Cluster)

但是,就像我提到的那样,当我使用8000个公司名称的原始矢量时,距离计算花费的时间太长,我无法继续.

However, like I mentioned, when I use my original vector of 8000 company names, the distance calculations take too long and I cannot proceed.

我的问题

当我使用较大的字符串向量时,此方法是否有解决方法?

Is there a work-around for this method, when I am working with a larger vector of strings?

也许对于更大的向量,聚类不是解决此问题的正确方法吗?在那种情况下,我还能做些什么来达到目标​​?

Maybe for larger vectors, clustering is not the right solution to this problem? In which case, what else could I do to achieve my result?

任何帮助将不胜感激.

推荐答案

摆脱内在的两个for循环,这会使您放慢脚步,并使用stringdistmatrix您的向量很长,但字符串很小,您将看到基准在底部.

Get rid of the inner two for loops that's what's slowing you down and use stringdistmatrix your vector is long but the strings are small you'll see the benchmark at the bottom.

library(stringdist)

strings <- c("ROYAL DUTCH SHELL","Kremlin Prestige","Bateaux Mouches","Red Square Kremlin Inc","SHELL USA","KLM NEDERLAND","KLM GROUP","SHELL AUSTRALIA","BP TANGUH","LEROY MERLIN","SHELL AZERBAIJAN","BMCC SHELL",
         "GAS PLANT BERLIN","SHELL AQUA MARINA","AUCHAN LEROY","GROUPE ROYAL TANGUH","klm hostel","SHELL","TANGUH TOWN","KPMG")
stringsBig <- rep(strings, 500)    
methods <- c("lcs", "osa", "cosine")
q <- c(0, 0, 3)  #size of q-gram    
dist.methods <- list()

# create distance matrix for every pair of listing, for each method
for (m in 1:length(methods)) {
  dist.methods[[m]] <- stringdistmatrix(stringsBig, method = methods[[m]], q = q[[m]])
}

microbenchmark::microbenchmark(stringdistmatrix(stringsBig),
                           for (i in 1:length(strings)) {
                             for (j in 1:length(strings)) {
                              stringdist(strings[i], strings[j])
                             }
                           },times = 100)

# Unit: microseconds
# expr                          min         lq       mean     median        uq       max neval cld
# stringdistmatrix(strings) 105.212   131.2805   241.9271   251.2235   279.634  2909.624   100  a 
# for loop                36147.878 38165.8480 40411.9772 39527.5500 42170.895 54151.457   100   b

microbenchmark::microbenchmark(stringdistmatrix(stringsBig), times=10)
# Unit: seconds
# expr    min       lq    mean   median       uq      max neval
# stringdistmatrix(stringsBig) 1.5324 1.585354 1.66592 1.655901 1.691157 1.825333    10

这篇关于在R中找到匹配的字符串组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆