Code Golf:从文本快速构建关键字列表,包括实例数 [英] Code Golf: Quickly Build List of Keywords from Text, Including # of Instances

查看:96
本文介绍了Code Golf:从文本快速构建关键字列表,包括实例数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用PHP为自己制定了该解决方案,但我很好奇如何可以以不同的方式完成它-甚至更好.我主要感兴趣的两种语言是PHP和Javascript,但我想知道今天还可以用其他任何主要语言(主要是C#,Java等)来完成这种语言的速度有多快.

I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).

  1. 仅返回出现次数大于X的单词
  2. 仅返回长度大于Y的单词
  3. 忽略常见的术语,例如"and,is,the等"
  4. 在处理之前可以随意删除标点符号(即约翰的"变成约翰")
  5. 返回结果的集合/数组

额外信用

  1. 将引用的语句保持在一起(例如,它们显然'太好了而不能为真'")
    其中太好了而不能为真"将是实际的声明

额外功劳

  1. 您的脚本是否可以根据单词被发现在一起的频率来确定应该在一起的单词?这样做是在事先不知道单词的情况下进行的.例子:
  1. Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
*"对于医学研究来说,果蝇是一件很了不起的事.过去对果蝇进行了大量研究,并取得了许多突破.在未来,果蝇将继续成为研究过,但我们的方法可能会改变."*
*"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."*

显然,这里的词是果蝇",这对我们来说很容易找到.您的search'n'scrape脚本也可以确定吗?

Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?

源文本: http://sampsonresume.com/labs/c.txt

答案格式

  1. 很高兴看到代码的结果,输出以及操作持续了多长时间.

推荐答案

GNU脚本

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr

结果:

  7 be
  6 to
[...]
  1 2.
  1 -

发生的次数大于X:

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'

仅返回长度大于Y的单词(在第二个grep中输入Y + 1个点):

Return only words with a length greater than Y (put Y+1 dots in second grep):

sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c

忽略诸如"and,is,the等"之类的通用术语(假设这些通用术语位于文件"ignored"中)

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c

在处理之前可以随意删除标点符号(即约翰的"变成约翰"):

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c

返回的结果是一个集合/数组:它已经像shell的数组:第一列是count,第二列是word.

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

这篇关于Code Golf:从文本快速构建关键字列表,包括实例数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆