Canopy聚类字符串以及输入格式和hadoop实现 [英] Canopy clustering over strings and the input format and hadoop implementation
问题描述
我想对字符串进行冠层聚类以减少距离和度量。但我不知道如何对字符串集进行冠层聚类。
当我搜索时,我获得了文本聚类的Apache hadoop实现。但是他们说输入格式应该是顺序矢量文件,其中输入应该是矢量可读格式。
我有一列字符串以及如何将其更改为java中的顺序文件和矢量文件以及如何使用hadoop canopy群集高效。
一栏话的例子:
很快< br $> b $ b需要
关闭
这个?
岳父
亲戚
来了
位置?
小
具体
''其中
确切地说
chennai-bangalore
路?'',
远
路?< br $>
州
对
地区
in?
发布
留言
brahmma
周
max
帮我谢谢
I want to do canopy clustering over strings to reduce the distance and the measures. But I not having any idea how to do canopy clustering over set of strings.
When I searched I got the Apache hadoop implementation of text clustering. But in that they said the input format should be sequential vector file in which the input should vector readable format.
I have a column of strings and how to change this into sequential file and vector file in java and how to use hadoop canopy clustering efficiently.
example of one column words :
quickly
need
close
this?
father-in-law
relatives
come
location?
little
specific
''where
exactly
chennai-bangalore
road?'',
far
road?
state
right
locality
in?
post
message
brahmma
weeks
max
help me thanks
推荐答案
下次再问google,这里 [ ^ ]
Next time, ask google first, here[^]
我认识的朋友。但是如何根据我的输入在Java中做到这一点?
I know friend. But how to do it in Java according to my input?
这篇关于Canopy聚类字符串以及输入格式和hadoop实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!