建立给定文本中最常用单词的ASCII图 [英] Build an ASCII chart of the most commonly used words in a given text

查看:45
本文介绍了建立给定文本中最常用单词的ASCII图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

构建给定文本中最常用单词的ASCII图表.

Build an ASCII chart of the most commonly used words in a given text.

规则:

  • 仅接受a-zA-Z(字母字符)作为单词的一部分.
  • 忽略大小写(出于我们的目的,She == she).
  • 忽略以下几句话(很奇怪,我知道):the, and, of, to, a, i, it, in, or, is
  • 说明:考虑don't:在a-zA-Z:(dont)范围内,这将被视为2个不同的单词".

  • Only accept a-z and A-Z (alphabetic characters) as part of a word.
  • Ignore casing (She == she for our purpose).
  • Ignore the following words (quite arbitary, I know): the, and, of, to, a, i, it, in, or, is
  • Clarification: considering don't: this would be taken as 2 different 'words' in the ranges a-z and A-Z: (don and t).

可选地(现在正式更改规格为时已晚),您可以选择删除所有单字母单词"(这可能使也可以缩短忽略列表).

Optionally (it's too late to be formally changing the specifications now) you may choose to drop all single-letter 'words' (this could potentially make for a shortening of the ignore list too).

解析给定的text(读取通过命令行参数指定的文件或通过管道输入;假定为us-ascii)并为我们构建具有以下特征的word frequency chart:

Parse a given text (read a file specified via command line arguments or piped in; presume us-ascii) and build us a word frequency chart with the following characteristics:

  • 显示22个最常用单词(按降序排列)的图表(另请参见下面的示例).
  • width表示单词出现的次数(频率)(按比例).附加一个空格并打印单词.
  • 确保这些小节(加上空格-单词-空格)始终为 fit :bar + [space] + word + [space]应始终为< = 80字符(请确保您考虑了可能不同的小节和单词长度:例如:第二个最常见的单词的长度可能比第一个更长,而频率的差异也不会太大).在这些约束范围内最大化条形宽度并适当地缩放条形(根据它们表示的频率).
  • Display the chart (also see the example below) for the 22 most common words (ordered by descending frequency).
  • The bar width represents the number of occurences (frequency) of the word (proportionally). Append one space and print the word.
  • Make sure these bars (plus space-word-space) always fit: bar + [space] + word + [space] should be always <= 80 characters (make sure you account for possible differing bar and word lengths: e.g.: the second most common word could be a lot longer then the first while not differing so much in frequency). Maximize bar width within these constraints and scale the bars appropriately (according to the frequencies they represent).

示例:

示例的文本可以在此处找到(刘易斯·卡罗尔(Lewis Carroll)的《爱丽丝梦游仙境》.

The text for the example can be found here (Alice's Adventures in Wonderland, by Lewis Carroll).

此特定文本将产生以下图表:

This specific text would yield the following chart:


 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|____________________________________________________| alice 
|______________________________________________| was 
|__________________________________________| that 
|___________________________________| as 
|_______________________________| her 
|____________________________| with 
|____________________________| at 
|___________________________| s 
|___________________________| t 
|_________________________| on 
|_________________________| all 
|______________________| this 
|______________________| for 
|______________________| had 
|_____________________| but 
|____________________| be 
|____________________| not 
|___________________| they 
|__________________| so 


供您参考:这些是以上图表建立的频率:

For your information: these are the frequencies the above chart is built upon:


[('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that
', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t'
, 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), ('
but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]

第二个示例(检查您是否实施了完整的规范): 将链接的 Alice in Wonderland 文件中每次出现的you替换为superlongstringstring:

A second example (to check if you implemented the complete spec): Replace every occurence of you in the linked Alice in Wonderland file with superlongstringstring:


 ________________________________________________________________
|________________________________________________________________| she 
|_______________________________________________________| superlongstringstring 
|_____________________________________________________| said 
|______________________________________________| alice 
|________________________________________| was 
|_____________________________________| that 
|______________________________| as 
|___________________________| her 
|_________________________| with 
|_________________________| at 
|________________________| s 
|________________________| t 
|______________________| on 
|_____________________| all 
|___________________| this 
|___________________| for 
|___________________| had 
|__________________| but 
|_________________| be 
|_________________| not 
|________________| they 
|________________| so 

获胜者:

最短的解决方案(按字符数和每种语言).玩得开心!

Shortest solution (by character count, per language). Have fun!

编辑:该表总结了到目前为止的结果(2012-02-15)(最初由用户Nas Banov添加):

Edit: Table summarizing the results so far (2012-02-15) (originally added by user Nas Banov):


Language          Relaxed  Strict
=========         =======  ======
GolfScript          130     143
Perl                        185
Windows PowerShell  148     199
Mathematica                 199
Ruby                185     205
Unix Toolchain      194     228
Python              183     243
Clojure                     282
Scala                       311
Haskell                     333
Awk                         336
R                   298
Javascript          304     354
Groovy              321
Matlab                      404
C#                          422
Smalltalk           386
PHP                 450
F#                          452
TSQL                483     507

数字表示特定语言中最短解决方案的长度. 严格"是指完全实现规范的解决方案(绘制|____|条,用____线关闭顶部的第一条条,考虑长单词出现频率高的可能性等). 放松"是指采取一些自由来缩短解决时间.

The numbers represent the length of the shortest solution in a specific language. "Strict" refers to a solution that implements the spec completely (draws |____| bars, closes the first bar on top with a ____ line, accounts for the possibility of long words with high frequency etc). "Relaxed" means some liberties were taken to shorten to solution.

仅包含少于500个字符的解决方案.语言列表按严格"解决方案的长度排序. "Unix工具链"用于表示使用传统* nix shell plus 混合工具(例如grep,tr,sort,uniq,head,perl,awk)的各种解决方案.

Only solutions shorter then 500 characters are included. The list of languages is sorted by the length of the 'strict' solution. 'Unix Toolchain' is used to signify various solutions that use traditional *nix shell plus a mix of tools (like grep, tr, sort, uniq, head, perl, awk).

推荐答案

LabVIEW 51个节点,5个结构,10个图

教大象进行踢踏舞从来都不是一件漂亮的事.我会啊,跳过字符数.

LabVIEW 51 nodes, 5 structures, 10 diagrams

Teaching the elephant to tap-dance is never pretty. I'll, ah, skip the character count.

程序从左向右流动:

这篇关于建立给定文本中最常用单词的ASCII图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆