如何将元组列表(标签,概率)的列表聚类? - Python [英] How can I cluster a list of a list of tuple (tag, probability)? - python
问题描述
我有一堆文本,它们被分类,然后每个文档被标记为0、1或2,并且每个标签都有一个概率.
I have a bunch of text and they are classified into categories and then each document is tagged 0, 1 or 2 with a probability for each tag.
[ "this is a foo bar",
"bar bar black sheep",
"sheep is an animal"
"foo foo bar bar"
"bar bar sheep sheep" ]
管道中的上一个工具将返回一个元组列表的列表,这样,外部列表中的每个元素都是一个文档.我只能以我知道每个文档都被标记为0、1或2以及它们的概率为这样的事实进行工作:
The previous tool in the pipeline returns a list of list of tuples as such, each element in the outer list is sort of a document. I can only work with the fact that I know each documents are tagged 0, 1 or 2 and their probabilities as such:
[ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]
我需要它查看每个元组列表中最有可能使用哪个标签并实现:
I need it to see which tag each of the list of tuple is most probable and achieve:
[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] ,
[[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] ,
[[(0,0.4), (1,0.4), (2,0.5)]] ]
另一个例子:
[in]
:
[ [(0,0.7), (1,0.2), (2,0.4)],
[(0,0.5), (1,0.9), (2,0.3)],
[(0,0.3), (1,0.8), (2,0.4)],
[(0,0.8), (1,0.2), (2,0.2)],
[(0,0.1), (1,0.7), (2,0.5)] ]
[out]
:
[[[(0,0.7), (1,0.2), (2,0.4)],
[(0,0.8), (1,0.2), (2,0.2)]] ,
[[(0,0.5), (1,0.9), (2,0.3)],
[(0,0.1), (1,0.7), (2,0.5)],
[(0,0.3), (1,0.8), (2,0.4)]] ,
[]]
注意::当数据进入我的管道部分时,我确实没有具有原始文本的访问权限.
NOTE: I do NOT have access to the original text when the data comes to my part of the pipeline.
如何将带有标签和概率的元组列表聚类? numpy
,scipy
,sklearn
或任何适用于Python的ML套件中是否有某些东西可以做到这一点?甚至NLTK
.
How can I cluster a list of a list of tuple with tags and probability? Is there something in numpy
, scipy
, sklearn
or any python-able ML suite to do that? or even NLTK
.
让我们假设群集的数目是固定的,但是群集的大小不是固定的.
Let's take it that the number of cluster is fixed but cluster size is not.
我只尝试找到质心的最大值,但这只给了我每个簇中的第一个值:
I've only tried finding maximum value of the centroid but that only gives me the first value in each cluster:
instream = [ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]
# Find centroid.
c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0]
c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0]
c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0]
c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0]
c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0]
c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0]
print instream[c1_centroid]
print instream[c2_centroid]
print instream[c2_centroid]
[out]
(每个群集中的顶部元素:
[out]
(top element in each cluster:
[(0, 0.5), (1, 0.3), (2, 0.3)]
[(0, 0.3), (1, 0.7), (2, 0.2)]
[(0, 0.3), (1, 0.7), (2, 0.2)]
推荐答案
如果我正确理解,这就是您想要的.
If I understood correctly, this is what you wanted.
import numpy as np
N_TYPES = 3
instream = [ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]
instream = np.array(instream)
# this removes document tags because we only consider probabilities here
values = [map(lambda x: x[1], doc) for doc in instream]
# determine the cluster of each document by using maximum probability
belongs_to = map(lambda x: np.argmax(x), values)
belongs_to = np.array(belongs_to)
# construct clusters of indices to your instream
cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)]
# apply the indices to obtain full output
out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)]
输出out
:
[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]],
[[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]],
[[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]],
[[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]],
[[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]]
我使用了numpy
数组,因为它们启用了不错的搜索和索引.例如,表达式(belongs_to == 1).nonzero()[0]
将索引数组返回到值为1
的数组belongs_to
.索引的示例是instream[cluster_indices[2]]
.
I used numpy
arrays because they enable nice searching and indexing. For example, the expression (belongs_to == 1).nonzero()[0]
returns the array of indices to array belongs_to
where the value is 1
. Example of indexing is instream[cluster_indices[2]]
.
这篇关于如何将元组列表(标签,概率)的列表聚类? - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!