合并词频数据列表 [英] Combining Lists of Word Frequency Data

查看:97
本文介绍了合并词频数据列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这似乎应该是一个显而易见的问题,但是列表上的教程和文档将不可用.其中许多问题源于我的文本文件的大小(数百MB),以及我试图将它们简化为系统可管理的内容的原因.结果,我正在按段进行工作,现在正在尝试合并结果.

This seems like it should be an obvious question, but the tutorials and documentation on lists are not forthcoming. Many of these issues stem from the sheer size of my text files (hundreds of MB) and my attempts to boil them down to something manageable by my system. As a result, I'm doing my work in segments and am now trying to combine the results.

我有多个单词频率列表(其中约40个).列表可以通过Import []获取,也可以作为Mathematica中生成的变量获取.每个列表如下所示,并且是使用Tally []和Sort []命令生成的:

I have multiple word frequency lists (~40 of them). The lists can either be taken through Import[ ] or as variables generated in Mathematica. Each list appears as the following and has been generated using the Tally[ ] and Sort[ ] commands:

{{"the",42216},{"of",24903},{"and",18624},{"n",16850},{"in",
16164},{"de",14930},{"a",14660},{"to",14175},{"la",7347}, {"was",6030},{"l",5981},{"le",5735},<< 51293 >>,{屠场", 1},{"abattement",1},{"abattagen",1},{"abattage",1}, {减轻",1},{放弃",1},{"abaiss",1},{"aback",1}, {"aase",1},{"aaijaut",1},{"aaaah",1},{"aaa",1}}

{{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 16850}, {"in",
16164}, {"de", 14930}, {"a", 14660}, {"to", 14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le", 5735}, <<51293>>, {"abattoir", 1}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

这是第二个文件的示例:

Here is an example of the second file:

{{"the",30419},{"n",20414},{"de",19956},{"of",16262},{"and",
14488},{"to",12726},{"a",12635},{"in",11141},{"la",10739}, {"et",9016},{"les",8675},{"le",7748},<< 101032 >>, {"abattement",1},{"abattagen",1},{"abattage",1},{"abated", 1},{"abandonn",1},{"abaiss",1},{"aback",1},{"aase",1}, {"aaijaut",1},{"aaaah",1},{"aaa",1}}

{{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 16262}, {"and",
14488}, {"to", 12726}, {"a", 12635}, {"in", 11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le", 7748}, <<101032>>, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

我想将它们组合起来,以便频率数据聚合:即,如果第二个文件中有30,419次"the"并且连接到第一个文件中,则它应返回存在72,635次(在我移动时依此类推)整个收藏集).

I want to combine them so that the frequency data aggregates: i.e. if the second file has 30,419 occurrences of 'the' and is joined to the first file, it should return that there are 72,635 occurrences (and so on as I move through the entire collection).

推荐答案

听起来您需要GatherBy.

假设您的两个列表分别命名为data1data2,然后使用

Suppose your two lists are named data1 and data2, then use

{#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Join[data1, data2], First]

这很容易推广到任意数量的列表,而不仅仅是两个.

This easily generalizes to any number of lists, not just two.

这篇关于合并词频数据列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆