用ELKI聚类字符串数据 [英] Clustering string data with ELKI

查看：121 发布时间：2020/6/13 19:00:05 string cluster-analysis edit-distance elki

本文介绍了用ELKI聚类字符串数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要基于编辑距离"/"Levenshtein距离"，使用ELKI对大量字符串进行聚类.由于数据集太大，我想避免使用基于文件的预先计算的距离矩阵.我该怎么办

I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I

(a)从文件中将字符串数据加载到ELKI中(仅标签")?

(a) load string data in ELKI from a file (only "Labels")?

(b)实现访问标签的距离函数(扩展AbstractDBIDDistanceFunction，但如何获取标签?)

(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)

某些代码段或示例输入文件会有所帮助.

Some code snippets or example input files would be helpful.

推荐答案

实际上非常简单:

A )编写足以您的输入文件格式的Parser(为什么要尝试重复使用为数字矢量编写的解析器带有标签?)，可能会继承

A) write a Parser that is adequate for your input file format (why try to reuse a parser written for numerical vectors with labels?), probably subclassing AbstractStreamingParser, producing a relation of the desired data type (probably you can just use String. If you want to be a bit more general TokenSequence may be a more appropriate concept for these distances. Strings are just the simplest case.

B )基于此向量类型而非DBID实现DistanceFunction，即PrimitiveDistanceFunction<String>.同样，将

B) implement a DistanceFunction based on this vector type instead of DBIDs, i.e. a PrimitiveDistanceFunction<String>. Again, subclassing AbstractPrimitiveDistanceFunction may be the easiest thing to do.

出于性能方面的考虑，您可能还希望研究索引算法来检索例如k个最相似的字符串有效.我不确定字符串编辑距离和levenshtein距离存在哪些索引结构.

For performance reasons, you may also want to look into indexing algorithms to retrieve e.g. the k most similar strings efficiently. I'm not sure which index structures exist for string edit distance and levenshtein distance.

某位同事的一名学生显然具有一些可操作的令牌编辑距离，但是我还没有看到或查看该代码.在处理日志文件时，他可能会使用基于令牌的方法来代替字符.

A colleague has a student that apparently has some working token edit distances, but I have not seen or reviewed the code yet. As he is processing log files, he will probably be using a token based approach instead of characters.

这篇关于用ELKI聚类字符串数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用ELKI聚类字符串数据 [英] Clustering string data with ELKI

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用ELKI聚类字符串数据 [英] Clustering string data with ELKI

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭