如何添加项目到collection.Counter?然后将它们排序为ASC? [英] How can I add items to collection.Counter? and then sort them into ASC?

查看:232
本文介绍了如何添加项目到collection.Counter?然后将它们排序为ASC?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我正尝试处理 lingspam数据集 ,通过计算600个文件(400个电子邮件和200个垃圾邮件)中的字词的出现。我已经使每个词通用 Porter Stemmer Aglorithm ,我也希望我的结果在每个文件标准化进一步处理。




$ b



为了获得下面的输出,我需要能够添加可能不存在的项目

 从./../lingspam_results/spmsgb164.txt.out 
[ 'money',0,'univers',0,'sales',0)]
从./../lingspam_results/spmsgb166.txt.out打印
[('money',2, 0','sales',0)]
从./../lingspam_results/spmsgb167.txt.out打印
[('money',0,'univers',0,'sales ',1)]

然后我计划转换为 code>使用 numpy

  0] 
[2,0,0]
[0,0,0]

而不是..

 从./../lingspam_results/spmsgb165.txt.out 
[ ]
从./../lingspam_results/spmsgb166.txt.out打印
[('univers',2)]
从./../lingspam_results/spmsgb167.txt.out打印
[('sale',1)]

如何标准化我的结果模块插入升序订单(同时将项目添加到计数器结果中, c> search_list )?我试过下面的一些东西,只是从每个文本文件读取,并创建一个基于 search_list 的列表。

  import numpy as np,os 
from collections import Counter

def parse_bag(directory,search_list):
words = []
for(dirpath,dirnames,filenames)in os.walk(directory):
for f in fileenames:
path = directory +/+ f
count_words(path,search_list)
return;

def count_words(filename,search_list):
textwords = open(filename,'r')read()。split()
filteredwords = if t in search_list]
wordfreq = Counter(filteredwords).most_common(5)
printprinting from+ filename
print wordfreq

search_list = ','univers','money']
parse_bag(./../ lingspam_results,search_list)


b $ b

感谢

解决方案

从你的问题,听起来你的要求是,所有文件的一致排序,计数。这应该为你:

  def count_words(filename,search_list):
textwords = r')。read()。split()
filteredwords = [t for textwords if t in search_list]
counter = Counter(filteredwords)
for w in search_list:
计数器[w] + = 0#确保存在
wordfreq = sorted(counter.items())
printprinting from+ filename
print wordfreq

search_list = ['sale','univers','money']

样品输出:

 从./../lingspam_results/spmsgb164.txt.out 
[('money',0) ('money',2),('sale',0),('univers',0)]
打印从./../lingspam_results/spmsgb166.txt.out
[ 0),('sale',0)]
从./../lingspam_results/spmsgb167.txt.out打印
[('money',0) ('univers',0)]

我不认为你想使用 most_common ,因为你明确不希望每个文件的内容影响排序或列表长度。


At the moment I'm trying to process lingspam dataset by counting the occurance of words in 600 files (400 emails and 200 spam emails). I've already made each word universal with the Porter Stemmer Aglorithm, I would also like for my result to be standardized across each file for further processing. But I'm unsure on how I can accomplish this..

Resources thus far

In order to get the output below I need to be able to add items that may not exist inside the file, in ascending order.

printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0, 'univers', 0,  'sales', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2, 'univers', 0,  'sales', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0, 'univers', 0,  'sales', 1)]

Which I then plan on converting into vectors using numpy.

[0,0,0]
[2,0,0]
[0,0,0]

instead of..

printing from ./../lingspam_results/spmsgb165.txt.out
[]
printing from ./../lingspam_results/spmsgb166.txt.out
[('univers', 2)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('sale', 1)]

How can I standardize my results from the Counter module into Ascending Order (while also adding items to the Counter Result that may not exist from my search_list)? I've tried something already below that simply reads from each text file and creates a list based on the search_list.

import numpy as np, os
from collections import Counter

def parse_bag(directory, search_list):
    words = []
    for (dirpath, dirnames, filenames) in os.walk(directory):
        for f in filenames:
            path = directory + "/" + f
            count_words(path, search_list)
    return;

def count_words(filename, search_list):
    textwords = open(filename, 'r').read().split()
    filteredwords = [t for t in textwords if t in search_list]
    wordfreq = Counter(filteredwords).most_common(5)
    print "printing from " + filename
    print wordfreq

search_list = ['sale', 'univers', 'money']
parse_bag("./../lingspam_results", search_list)

Thanks

解决方案

From your question, it sounds like your requirements are that you want the same words in a consistent ordering across all files, with counts. This should do it for you:

def count_words(filename, search_list):
    textwords = open(filename, 'r').read().split()
    filteredwords = [t for t in textwords if t in search_list]
    counter = Counter(filteredwords)
    for w in search_list:
        counter[w] += 0        # ensure exists
    wordfreq = sorted(counter.items())
    print "printing from " + filename
    print wordfreq

search_list = ['sale', 'univers', 'money']

sample output:

printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0), ('sale', 0), ('univers', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2), ('sale', 0), ('univers', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0), ('sale', 1), ('univers', 0)]

I don't think you want to use most_common at all since you specifically don't want the contents of each file to affect the ordering or list length.

这篇关于如何添加项目到collection.Counter?然后将它们排序为ASC?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆