根据他们的字符集簇的话 [英] clustering words based on their char set

查看:100
本文介绍了根据他们的字符集簇的话的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说有一个词集,我想根据自己的炭包(多集),以集群它们。例如

Say there is a word set and I would like to clustering them based on their char bag (multiset). For example

{喝茶,吃饭,ABBA,AABB,你好}

{tea, eat, abba, aabb, hello}

将聚成

{{茶,吃},{ABBA,AABB},{你好}}。

{{tea, eat}, {abba, aabb}, {hello}}.

ABBA AABB 聚集在一起,因为它们具有相同的炭包,即两个和两个 B

abba and aabb are clustered together because they have the same char bag, i.e. two a and two b.

要让它有效,一个天真的方法可以让我想到的是隐蔽的每一个字成一个char-CNT系列,为〔实施例, ABBA AABB 将都转换为 A2B2 ,茶/吃了会被转换为 a1e1t1 。所以,我可以建立与相同的密钥字典和组词。

To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba and aabb will be both converted to a2b2, tea/eat will be converted to a1e1t1. So that I can build a dictionary and group words with same key.

两个问题:首先,我要的字符排序来构建的关键;第二,该字符串键看起来很笨拙且性能不如CHAR / INT键。

Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.

有没有解决问题的更有效的方法?

Is there a more efficient way to solve the problem?

推荐答案

有关检测字谜,您可以使用基于素数的乘积哈希方案 A-> 2,B> 3 ,C> 5 等,将给予利群==AABB== 36(但不同的信primenumber映射会更好) 见我的回答这里

For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5 etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better) See my answer here.

这篇关于根据他们的字符集簇的话的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆