如何在不区分大小写的纯文本文件中计数唯一字词? [英] How can I count unique terms in a plaintext file case-insensitively?
问题描述
这可以是任何高级语言,可能在典型的类UNIX系统(Python,Perl,awk,标准unix工具{sort,uniq}等)。希望它的速度足以报告2MB文本文件的唯一字词的总数。
我只需要这个快速清理检查,所以它不需要
非常感谢你们。
请记住,不区分大小写。 >
注意:如果您使用Python,请不要使用仅版本3的代码。我在运行它的系统只有2.4.4。
在Python 2.4 well):
#! /usr/bin/python2.4
import sys
h = set()
对于sys.stdin.xreadlines()中的行:
for line.split():
h.add(term)
print len(h)
:
$ perl -ne'for(split(,$ _)){$ H {$ _} = 1 } END {打印标量(键%H),\\\
}'< file.txt
This can be in any high-level language that is likely to be available on a typical unix-like system (Python, Perl, awk, standard unix utils {sort, uniq}, etc). Hopefully it's fast enough to report the total number of unique terms for a 2MB text file.
I only need this for quick sanity-checking, so it doesn't need to be well-engineered.
Remember, case-insensitve.
Thank you guys very much.
Side note: If you use Python, please don't use version 3-only code. The system I'm running it on only has 2.4.4.
In Python 2.4 (possibly it works on earlier systems as well):
#! /usr/bin/python2.4
import sys
h = set()
for line in sys.stdin.xreadlines():
for term in line.split():
h.add(term)
print len(h)
In Perl:
$ perl -ne 'for (split(" ", $_)) { $H{$_} = 1 } END { print scalar(keys%H), "\n" }' <file.txt
这篇关于如何在不区分大小写的纯文本文件中计数唯一字词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!