如何使用python在文件中找到一组最常出现的词对? [英] How to find set of most frequently occurring word-pairs in a file using python?

查看:59
本文介绍了如何使用python在文件中找到一组最常出现的词对?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集如下:

"485","AlterNet","Statistics","Estimation","Narnia","两个半人"《717》、《我喜欢希恩》、《纳尼亚》、《统计》、《估计》633"、机器学习"、人工智能"、我喜欢汽车,但我也喜欢自行车"《717》、《我喜欢Sheen》、《机器学习》、《回归》、《AI》136"、机器学习"、人工智能"、TopGear"

等等

我想找出最常出现的词对,例如

(统计,估计:2)(统计,纳尼亚:2)(纳尼亚,统计)(机器学习,人工智能:3)

这两个词可以按任意顺序排列,彼此之间的距离也可以任意

有人可以在 python 中提出一个可能的解决方案吗?这是一个非常大的数据集.

非常感谢任何建议

所以这是我在@275365 的建议之后尝试的

@275365 我尝试了以下从文件中读取的输入

 def collect_pairs(file):pair_counter = Counter()对于打开(文件)中的行:unique_tokens = sorted(set(line))组合 = 组合(unique_tokens,2)pair_counter += 计数器(组合)打印pair_counterfile = ('myfileComb.txt')p=collect_pairs(文件)

文本文件的行数与原始文件的行数相同,但在特定行中只有唯一的标记.我不知道我做错了什么,因为当我运行它时,它会将单词拆分为字母,而不是将输出作为单词的组合.当我运行此文件时,它会按预期输出拆分字母而不是单词组合.我不知道我哪里出错了.

解决方案

你可以从这样的事情开始,这取决于你的语料库有多大:

<预><代码>>>>从 itertools 导入组合>>>从集合导入计数器>>>def collect_pairs(行):pair_counter = Counter()对于线中线:unique_tokens = sorted(set(line)) # 排除同一行中的重复项并排序以确保一个词总是在另一个之前组合 = 组合(unique_tokens,2)pair_counter += 计数器(组合)返回pair_counter

结果:

<预><代码>>>>t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', '我喜欢汽车,但我也喜欢自行车'], ['717', '我喜欢 Sheen', 'MachineLearning', 'regression', 'AI'], ['136', '机器学习', 'AI', 'TopGear']]>>>对 = collect_pairs(t2)>>>pair.most_common(3)[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]

您是否希望这些组合中包含数字?由于您没有特别提到排除它们,因此我将它们包含在此处.

使用文件对象

您在上面第一次尝试发布的功能非常接近工作.您唯一需要做的就是将每一行(它是一个字符串)更改为一个元组或列表.假设您的数据看起来与您上面发布的数据完全一样(每个术语都带有引号,术语之间用逗号分隔),我建议一个简单的解决方法:您可以使用 ast.literal_eval.(否则,您可能需要使用某种正则表达式.)请参阅下面使用 ast.literal_eval 的修改版本:

from itertools 导入组合从集合导入计数器进口ASTdef collect_pairs(file_name):pair_counter = Counter()for line in open(file_name): # 这些行都是一个长字符串;你需要一个列表或元组unique_tokens = sorted(set(ast.literal_eval(line))) # eval 将在将元组转换为集合之前将每一行转换为元组组合 = 组合(unique_tokens,2)pair_counter += 计数器(组合)return pair_counter # 返回实际的 Counter 对象

现在你可以这样测试了:

file_name = 'myfileComb.txt'p = collect_pairs(file_name)打印 p.most_common(10) # 例如

I have a data set as follows:

"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"

and so on

I want to find out the most frequently occurring word-pairs e.g.

(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)

The two words could be in any order and at any distance from each other

Can someone suggest a possible solution in python? This is a very large data set.

Any suggestion is highly appreciated

So this is what I tried after suggestions from @275365

@275365 I tried the following with input read from a file

    def collect_pairs(file):
        pair_counter = Counter()
        for line in open(file):
            unique_tokens = sorted(set(line))  
            combos = combinations(unique_tokens, 2)
            pair_counter += Counter(combos)
            print pair_counter

    file = ('myfileComb.txt')
    p=collect_pairs(file)

text file has same number of lines as the original one but has only unique tokens in a particular line. I don't know what am I doing wrong since when I run this it splits the words in letters rather than giving output as combinations of words. When I run this file it outputs split letters rather than combinations of words as expected. I dont know where I am making a mistake.

解决方案

You might start with something like this, depending on how large your corpus is:

>>> from itertools import combinations
>>> from collections import Counter

>>> def collect_pairs(lines):
    pair_counter = Counter()
    for line in lines:
        unique_tokens = sorted(set(line))  # exclude duplicates in same line and sort to ensure one word is always before other
        combos = combinations(unique_tokens, 2)
        pair_counter += Counter(combos)
    return pair_counter

The result:

>>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
>>> pairs = collect_pairs(t2)
>>> pairs.most_common(3)
[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]

Do you want numbers included in these combinations or not? Since you didn't specifically mention excluding them, I have included them here.

EDIT: Working with a file object

The function that you posted as your first attempt above is very close to working. The only thing you need to do is change each line (which is a string) into a tuple or list. Assuming your data looks exactly like the data you posted above (with quotation marks around each term and commas separating the terms), I would suggest a simple fix: you can use ast.literal_eval. (Otherwise, you might need to use a regular expression of some kind.) See below for a modified version with ast.literal_eval:

from itertools import combinations
from collections import Counter
import ast

def collect_pairs(file_name):
    pair_counter = Counter()
    for line in open(file_name):  # these lines are each simply one long string; you need a list or tuple
        unique_tokens = sorted(set(ast.literal_eval(line)))  # eval will convert each line into a tuple before converting the tuple to a set
        combos = combinations(unique_tokens, 2)
        pair_counter += Counter(combos)
    return pair_counter  # return the actual Counter object

Now you can test it like this:

file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10)  # for example

这篇关于如何使用python在文件中找到一组最常出现的词对?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆