如何在不区分大小写的纯文本文件中计数唯一字词？ [英] How can I count unique terms in a plaintext file case-insensitively?

查看：179 发布时间：2017/1/12 19:25:14 python perl unix count awk

本文介绍了如何在不区分大小写的纯文本文件中计数唯一字词？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这可以是任何高级语言，可能在典型的类UNIX系统（Python，Perl，awk，标准unix工具{sort，uniq}等）。希望它的速度足以报告2MB文本文件的唯一字词的总数。

我只需要这个快速清理检查，所以它不需要

非常感谢你们。

请记住，不区分大小写。 >

注意：如果您使用Python，请不要使用仅版本3的代码。我在运行它的系统只有2.4.4。

解决方案

在Python 2.4 well）：

 ＃！ /usr/bin/python2.4 
 import sys 
h = set（）
对于sys.stdin.xreadlines（）中的行：
 for line.split（）： 
 h.add（term）
 print len（h）

：

  $ perl -ne'for（split（，$ _））{$ H {$ _} = 1 } END {打印标量（键％H），\\\
}'< file.txt

This can be in any high-level language that is likely to be available on a typical unix-like system (Python, Perl, awk, standard unix utils {sort, uniq}, etc). Hopefully it's fast enough to report the total number of unique terms for a 2MB text file.

I only need this for quick sanity-checking, so it doesn't need to be well-engineered.

Remember, case-insensitve.

Thank you guys very much.

Side note: If you use Python, please don't use version 3-only code. The system I'm running it on only has 2.4.4.

解决方案

In Python 2.4 (possibly it works on earlier systems as well):

#! /usr/bin/python2.4
import sys
h = set()
for line in sys.stdin.xreadlines():
  for term in line.split():
    h.add(term)
print len(h)

In Perl:

$ perl -ne 'for (split(" ", $_)) { $H{$_} = 1 } END { print scalar(keys%H), "\n" }' <file.txt

这篇关于如何在不区分大小写的纯文本文件中计数唯一字词？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在不区分大小写的纯文本文件中计数唯一字词？ [英] How can I count unique terms in a plaintext file case-insensitively?

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

如何在不区分大小写的纯文本文件中计数唯一字词？ [英] How can I count unique terms in a plaintext file case-insensitively?

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭