用ELKI聚类字符串数据 [英] Clustering string data with ELKI
问题描述
我需要基于编辑距离"/"Levenshtein距离",使用ELKI对大量字符串进行聚类.由于数据集太大,我想避免使用基于文件的预先计算的距离矩阵.我该怎么办
I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I
(a)从文件中将字符串数据加载到ELKI中(仅标签")?
(a) load string data in ELKI from a file (only "Labels")?
(b)实现访问标签的距离函数(扩展AbstractDBIDDistanceFunction,但如何获取标签?)
(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)
某些代码段或示例输入文件会有所帮助.
Some code snippets or example input files would be helpful.
推荐答案
实际上非常简单:
A )编写足以您的输入文件格式的Parser
(为什么要尝试重复使用为数字矢量编写的解析器带有标签?),可能会继承
A) write a Parser
that is adequate for your input file format (why try to reuse a parser written for numerical vectors with labels?), probably subclassing AbstractStreamingParser
, producing a relation of the desired data type (probably you can just use String
. If you want to be a bit more general TokenSequence
may be a more appropriate concept for these distances. Strings are just the simplest case.
B )基于此向量类型而非DBID实现DistanceFunction
,即PrimitiveDistanceFunction<String>
.同样,将
B) implement a DistanceFunction
based on this vector type instead of DBIDs, i.e. a PrimitiveDistanceFunction<String>
. Again, subclassing AbstractPrimitiveDistanceFunction
may be the easiest thing to do.
出于性能方面的考虑,您可能还希望研究索引算法来检索例如k个最相似的字符串有效.我不确定字符串编辑距离和levenshtein距离存在哪些索引结构.
For performance reasons, you may also want to look into indexing algorithms to retrieve e.g. the k most similar strings efficiently. I'm not sure which index structures exist for string edit distance and levenshtein distance.
某位同事的一名学生显然具有一些可操作的令牌编辑距离,但是我还没有看到或查看该代码.在处理日志文件时,他可能会使用基于令牌的方法来代替字符.
A colleague has a student that apparently has some working token edit distances, but I have not seen or reviewed the code yet. As he is processing log files, he will probably be using a token based approach instead of characters.
这篇关于用ELKI聚类字符串数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!