用ELKI聚类字符串数据 [英] Clustering string data with ELKI

查看:121
本文介绍了用ELKI聚类字符串数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要基于编辑距离"/"Levenshtein距离",使用ELKI对大量字符串进行聚类.由于数据集太大,我想避免使用基于文件的预先计算的距离矩阵.我该怎么办

I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I

(a)从文件中将字符串数据加载到ELKI中(仅标签")?

(a) load string data in ELKI from a file (only "Labels")?

(b)实现访问标签的距离函数(扩展AbstractDBIDDistanceFunction,但如何获取标签?)

(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)

某些代码段或示例输入文件会有所帮助.

Some code snippets or example input files would be helpful.

推荐答案

实际上非常简单:

A )编写足以您的输入文件格式Parser(为什么要尝试重复使用为数字矢量编写的解析器带有标签?),可能会继承

A) write a Parser that is adequate for your input file format (why try to reuse a parser written for numerical vectors with labels?), probably subclassing AbstractStreamingParser, producing a relation of the desired data type (probably you can just use String. If you want to be a bit more general TokenSequence may be a more appropriate concept for these distances. Strings are just the simplest case.

B )基于此向量类型而非DBID实现DistanceFunction,即PrimitiveDistanceFunction<String>.同样,将

B) implement a DistanceFunction based on this vector type instead of DBIDs, i.e. a PrimitiveDistanceFunction<String>. Again, subclassing AbstractPrimitiveDistanceFunction may be the easiest thing to do.

出于性能方面的考虑,您可能还希望研究索引算法来检索例如k个最相似的字符串有效.我不确定字符串编辑距离和levenshtein距离存在哪些索引结构.

For performance reasons, you may also want to look into indexing algorithms to retrieve e.g. the k most similar strings efficiently. I'm not sure which index structures exist for string edit distance and levenshtein distance.

某位同事的一名学生显然具有一些可操作的令牌编辑距离,但是我还没有看到或查看该代码.在处理日志文件时,他可能会使用基于令牌的方法来代替字符.

A colleague has a student that apparently has some working token edit distances, but I have not seen or reviewed the code yet. As he is processing log files, he will probably be using a token based approach instead of characters.

这篇关于用ELKI聚类字符串数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆