如何将元组列表(标签,概率)的列表聚类? - Python [英] How can I cluster a list of a list of tuple (tag, probability)? - python

查看:86
本文介绍了如何将元组列表(标签,概率)的列表聚类? - Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆文本,它们被分类,然后每个文档被标记为0、1或2,并且每个标签都有一个概率.

I have a bunch of text and they are classified into categories and then each document is tagged 0, 1 or 2 with a probability for each tag.

[ "this is a foo bar",
  "bar bar black sheep",
  "sheep is an animal"
  "foo foo bar bar"
  "bar bar sheep sheep" ]

管道中的上一个工具将返回一个元组列表的列表,这样,外部列表中的每个元素都是一个文档.我只能以我知道每个文档都被标记为0、1或2以及它们的概率为这样的事实进行工作:

The previous tool in the pipeline returns a list of list of tuples as such, each element in the outer list is sort of a document. I can only work with the fact that I know each documents are tagged 0, 1 or 2 and their probabilities as such:

[ [(0,0.3), (1,0.5), (2,0.1)],
  [(0,0.5), (1,0.3), (2,0.3)],
  [(0,0.4), (1,0.4), (2,0.5)],
  [(0,0.3), (1,0.7), (2,0.2)],
  [(0,0.2), (1,0.6), (2,0.1)] ]

我需要它查看每个元组列表中最有可能使用哪个标签并实现:

I need it to see which tag each of the list of tuple is most probable and achieve:

[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] ,
  [[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] ,
  [[(0,0.4), (1,0.4), (2,0.5)]] ]

另一个例子:

[in]:

[ [(0,0.7), (1,0.2), (2,0.4)],
  [(0,0.5), (1,0.9), (2,0.3)],
  [(0,0.3), (1,0.8), (2,0.4)],
  [(0,0.8), (1,0.2), (2,0.2)],
  [(0,0.1), (1,0.7), (2,0.5)] ]

[out]:

 [[[(0,0.7), (1,0.2), (2,0.4)],
 [(0,0.8), (1,0.2), (2,0.2)]] ,

 [[(0,0.5), (1,0.9), (2,0.3)],
 [(0,0.1), (1,0.7), (2,0.5)],
 [(0,0.3), (1,0.8), (2,0.4)]] ,

 []]

注意::当数据进入我的管道部分时,我确实没有具有原始文本的访问权限.

NOTE: I do NOT have access to the original text when the data comes to my part of the pipeline.

如何将带有标签和概率的元组列表聚类? numpyscipysklearn或任何适用于Python的ML套件中是否有某些东西可以做到这一点?甚至NLTK.

How can I cluster a list of a list of tuple with tags and probability? Is there something in numpy, scipy, sklearn or any python-able ML suite to do that? or even NLTK.

让我们假设群集的数目是固定的,但是群集的大小不是固定的.

Let's take it that the number of cluster is fixed but cluster size is not.

我只尝试找到质心的最大值,但这只给了我每个簇中的第一个值:

I've only tried finding maximum value of the centroid but that only gives me the first value in each cluster:

instream = [ [(0,0.3), (1,0.5), (2,0.1)],
                        [(0,0.5), (1,0.3), (2,0.3)],
                        [(0,0.4), (1,0.4), (2,0.5)],
                        [(0,0.3), (1,0.7), (2,0.2)],
                        [(0,0.2), (1,0.6), (2,0.1)] ]

# Find centroid.  
c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0]
c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0]
c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0]

c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0]
c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0]
c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0]

print instream[c1_centroid]
print instream[c2_centroid]
print instream[c2_centroid]

[out](每个群集中的顶部元素:

[out] (top element in each cluster:

[(0, 0.5), (1, 0.3), (2, 0.3)]
[(0, 0.3), (1, 0.7), (2, 0.2)]
[(0, 0.3), (1, 0.7), (2, 0.2)]

推荐答案

如果我正确理解,这就是您想要的.

If I understood correctly, this is what you wanted.

import numpy as np

N_TYPES = 3

instream = [ [(0,0.3), (1,0.5), (2,0.1)],
             [(0,0.5), (1,0.3), (2,0.3)],
             [(0,0.4), (1,0.4), (2,0.5)],
             [(0,0.3), (1,0.7), (2,0.2)],
             [(0,0.2), (1,0.6), (2,0.1)] ]
instream = np.array(instream)

# this removes document tags because we only consider probabilities here
values = [map(lambda x: x[1], doc) for doc in instream]

# determine the cluster of each document by using maximum probability
belongs_to = map(lambda x: np.argmax(x), values)
belongs_to = np.array(belongs_to)

# construct clusters of indices to your instream
cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)]

# apply the indices to obtain full output
out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)]   

输出out:

[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]],

 [[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]],
  [[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]],
  [[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]],

 [[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]]

我使用了numpy数组,因为它们启用了不错的搜索和索引.例如,表达式(belongs_to == 1).nonzero()[0]将索引数组返回到值为1的数组belongs_to.索引的示例是instream[cluster_indices[2]].

I used numpy arrays because they enable nice searching and indexing. For example, the expression (belongs_to == 1).nonzero()[0] returns the array of indices to array belongs_to where the value is 1. Example of indexing is instream[cluster_indices[2]].

这篇关于如何将元组列表(标签,概率)的列表聚类? - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆