使用 numpy 或 pandas 从元组列表中为双元组创建频率矩阵 [英] Create a frequency matrix for bigrams from a list of tuples, using numpy or pandas

查看:63
本文介绍了使用 numpy 或 pandas 从元组列表中为双元组创建频率矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 Python 非常陌生.我有一个元组列表,我在其中创建了二元组.

I am very new to Python. I have a list of tuples, where I created bigrams.

这个问题非常接近我的需求

my_list = [('we', 'consider'), ('what', 'to'), ('use', 'the'), ('words', 'of')]

现在我正在尝试将其转换为频率矩阵

Now I am trying to convert this into a frequency matrix

所需的输出是

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   0     0      0
of               0   0    0   0    0   0     0      0
the              0   0    0   0    0   0     0      0
to               0   0    0   0    0   0     0      0
use              0   0    1   0    0   0     0      0
we               1   0    0   0    0   0     0      0
what             0   0    0   1    0   0     0      0
words            0   1    0   0    0   0     0      0

如何使用 numpypandas 执行此操作?不幸的是,我只能用 nltk 看到一些东西.

How to do this, using numpy or pandas? I can see something with nltk only, unfortunately.

推荐答案

您可以创建频率数据框并按单词调用索引值:

You can create frequancy data frame and call index-values by words:

words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for i in my_list:
  df.at[i[0],i[1]] += 1

输出:

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   0     0      0
of               0   0    0   0    0   0     0      0
the              0   0    0   0    0   0     0      0
to               0   0    0   0    0   0     0      0
use              0   0    1   0    0   0     0      0
we               1   0    0   0    0   0     0      0
what             0   0    0   1    0   0     0      0
words            0   1    0   0    0   0     0      0

请注意,在此中,bigram 中的顺序很重要.如果你不关心顺序,你应该首先按内容对元组进行排序,使用这个:

Note that in this one, the order in the bigram matters. If you don't care about order, you should sort the tuples by their content first, using this:

my_list = [tuple(sorted(i)) for i in my_list]

另一种方法是使用 Counter 进行计数,但我希望它具有相似的性能(同样,如果 bigrams 中的顺序很重要,请从 中删除 sorted频率列表):

Another way is to use Counter to do the count, but I expect it to be similar performance(again if order in bigrams matters, remove sorted from frequency_list):

from collections import Counter

frequency_list = Counter(tuple(sorted(i)) for i in my_list)
words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for k,v in frequency_list.items():
  df.at[k[0],k[1]] = v

输出:

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   1     0      0
of               0   0    0   0    0   0     0      1
the              0   0    0   0    1   0     0      0
to               0   0    0   0    0   0     1      0
use              0   0    0   0    0   0     0      0
we               0   0    0   0    0   0     0      0
what             0   0    0   0    0   0     0      0
words            0   0    0   0    0   0     0      0

这篇关于使用 numpy 或 pandas 从元组列表中为双元组创建频率矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆