具有Levenshtein距离和其他特征/变量的聚类算法 [英] Cluster algorithm with Levenshtein distance and additional features/variables

查看:90
本文介绍了具有Levenshtein距离和其他特征/变量的聚类算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由文本令牌(单词,不同种类的识别号和一些其他类型)组成的数据集,我想使用一些无监督的分类算法对其进行分类.

I have a dataset consisting of text-tokens (words, different kinds of identification numbers and some additional types) that I want to classify using some unsupervised classification algorithm.

给出了我从文本中提取的某些功能(字符数,数字位数,字母数,正则表达式等),例如kmeans算法(仅作为示例,我不限于kmeans)工作正常,但我想添加更多详细信息,例如可以与hclust一起使用的Levenshtein距离.

Given some kinds of features that I extract from the text (# of characters, # of digits, # of alphas, some regexes etc) algorithms such as kmeans (just as an example, I am not bound to kmeans) work fine, but I want to add some more details such as the Levenshtein-distance, which I can use with hclust.

但是,我还不太了解如何组合两种不同的数据类型(链接到两个观测值的数据,例如距离度量,以及仅链接到两个观测值的数据)的起点.一种观察,例如每个令牌具有的字符数.

However, I don't quite find a starting point of how to combine the two-different data-types (the data that is linked to two observations, such as the distance-metrics, and the data that is linked to only one observation, such as the number of characters each token has).

我错过了一些简单的部分吗,甚至有可能还是我只是在寻找错误的算法?

Did I miss some easy part, is it even possible or did I just look for the wrong algorithm?

下面,您将找到一个小型数据集的示例以及到目前为止我采用的不同方法.

Below, you find an example of a small dataset and the different approaches I have taken so far.


# create some data
set.seed(123)
x <- sapply(1:20, function(i) {
 paste(c(
  sample(LETTERS, sample(1:10, 1), replace = T),
  sample(1:9, sample(1:10, 1), replace = T),
  sample(LETTERS[1:10], 2)
 ), collapse = "")
})
head(x)
#> [1] "UKW1595595761IC" "I9769675632JI"   "UAMTFIG44DB"     "GM814HB"        
#> [5] "FDTXJR4CH"       "VVULT7152464BC"

# apply the different algorithms
# 1. K-means
df <- data.frame(x)
df$nchars <- nchar(x)
df$n_nums <- nchar(gsub("[^[:digit:]]", "", x))
# etc.

kclust <- kmeans(df[, 2:3], centers = 2)
pairs(df, col=c(2:3)[kclust$cluster]) 

# 2. Levensthein distance and hclust
distance <- adist(x)
rownames(distance) <- x
hc <- hclust(as.dist(distance)) 
plot(hc)

# 3. Combination of adist(x) and the df-variables
# ???

推荐答案

如果您想要一种将Levenshtein度量与欧几里得距离之类的度量组合的方法,则可以通过合并距离矩阵来实现,例如相同的形状,并将其发送到hclust.

If you want a method for combining the metrics of Levenshtein and something like the Euclidean distance, you can do it by combining the distance matrices, as they are of the same shape, and send it to hclust.

stats <- cbind(df$nchars, df$n_nums)

euc <- as.matrix(dist(stats))
rownames(euc) <- x

lev <- adist(x)
rownames(lev) <- x

scale01 <- function(x) {
    z <- (x - min(x))
    z / max(z)
}

combi <- scale01(euc) + scale01(lev)

hc.combi <- hclust(as.dist(combi))
plot(hc.combi)

当然,您可以随心所欲地对这两个矩阵进行加权.

Of course you can weight the two matrices however you like.

如果您想将k均值和层次聚类相结合,我知道一种方法.本质上,您是在矩阵上执行分层聚类,将其分为k个组,计算每个组的均值,然后将这些均值作为k均值的起始质心传递.

If you want to combine k-means and hierarchical clustering I know of one way to do that. Essentially you perform hierarchical clustering on a matrix, divide it up into k groups, calculate the mean of each group and pass those means as the starting centroids for the k-means.

hc2 <- hclust(dist(stats))
clusters <- cutree(hc2, k=3)

centers <- aggregate(stats, list(clusters), mean)[, -1]

hkclust <- kmeans(stats, centers)
pairs(df, col=c(2:4)[hkclust$cluster])

如果您想将k-means与Levenshtein结合使用,恐怕我不知道该怎么做,因为将距离矩阵传递给k-means没有多大意义.也许 k-medoids 可以工作吗?

If you want to combine k-means with Levenshtein, I'm afraid I don't know how to do that, as it doesn't make much sense to pass a distance matrix to k-means. Maybe k-medoids could work?

这篇关于具有Levenshtein距离和其他特征/变量的聚类算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆