建立给定文本中最常用单词的ASCII图 [英] Build an ASCII chart of the most commonly used words in a given text
问题描述
构建给定文本中最常用单词的ASCII图表.
Build an ASCII chart of the most commonly used words in a given text.
规则:
- 仅接受
a-z
和A-Z
(字母字符)作为单词的一部分. - 忽略大小写(出于我们的目的,
She
==she
). - 忽略以下几句话(很奇怪,我知道):
the, and, of, to, a, i, it, in, or, is
-
说明:考虑
don't
:在a-z
和A-Z
:(don
和t
)范围内,这将被视为2个不同的单词".
- Only accept
a-z
andA-Z
(alphabetic characters) as part of a word. - Ignore casing (
She
==she
for our purpose). - Ignore the following words (quite arbitary, I know):
the, and, of, to, a, i, it, in, or, is
Clarification: considering
don't
: this would be taken as 2 different 'words' in the rangesa-z
andA-Z
: (don
andt
).
可选地(现在正式更改规格为时已晚),您可以选择删除所有单字母单词"(这可能使也可以缩短忽略列表).
Optionally (it's too late to be formally changing the specifications now) you may choose to drop all single-letter 'words' (this could potentially make for a shortening of the ignore list too).
解析给定的text
(读取通过命令行参数指定的文件或通过管道输入;假定为us-ascii
)并为我们构建具有以下特征的word frequency chart
:
Parse a given text
(read a file specified via command line arguments or piped in; presume us-ascii
) and build us a word frequency chart
with the following characteristics:
- 显示22个最常用单词(按降序排列)的图表(另请参见下面的示例).
- 条
width
表示单词出现的次数(频率)(按比例).附加一个空格并打印单词. - 确保这些小节(加上空格-单词-空格)始终为 fit :
bar
+[space]
+word
+[space]
应始终为< =80
字符(请确保您考虑了可能不同的小节和单词长度:例如:第二个最常见的单词的长度可能比第一个更长,而频率的差异也不会太大).在这些约束范围内最大化条形宽度并适当地缩放条形(根据它们表示的频率).
- Display the chart (also see the example below) for the 22 most common words (ordered by descending frequency).
- The bar
width
represents the number of occurences (frequency) of the word (proportionally). Append one space and print the word. - Make sure these bars (plus space-word-space) always fit:
bar
+[space]
+word
+[space]
should be always <=80
characters (make sure you account for possible differing bar and word lengths: e.g.: the second most common word could be a lot longer then the first while not differing so much in frequency). Maximize bar width within these constraints and scale the bars appropriately (according to the frequencies they represent).
示例:
示例的文本可以在此处找到(刘易斯·卡罗尔(Lewis Carroll)的《爱丽丝梦游仙境》.
The text for the example can be found here (Alice's Adventures in Wonderland, by Lewis Carroll).
此特定文本将产生以下图表:
This specific text would yield the following chart:
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so
供您参考:这些是以上图表建立的频率:
For your information: these are the frequencies the above chart is built upon:
[('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that
', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t'
, 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), ('
but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]
第二个示例(检查您是否实施了完整的规范):
将链接的 Alice in Wonderland 文件中每次出现的you
替换为superlongstringstring
:
A second example (to check if you implemented the complete spec):
Replace every occurence of you
in the linked Alice in Wonderland file with superlongstringstring
:
________________________________________________________________
|________________________________________________________________| she
|_______________________________________________________| superlongstringstring
|_____________________________________________________| said
|______________________________________________| alice
|________________________________________| was
|_____________________________________| that
|______________________________| as
|___________________________| her
|_________________________| with
|_________________________| at
|________________________| s
|________________________| t
|______________________| on
|_____________________| all
|___________________| this
|___________________| for
|___________________| had
|__________________| but
|_________________| be
|_________________| not
|________________| they
|________________| so
获胜者:
最短的解决方案(按字符数和每种语言).玩得开心!
Shortest solution (by character count, per language). Have fun!
编辑:该表总结了到目前为止的结果(2012-02-15)(最初由用户Nas Banov添加):
Edit: Table summarizing the results so far (2012-02-15) (originally added by user Nas Banov):
Language Relaxed Strict
========= ======= ======
GolfScript 130 143
Perl 185
Windows PowerShell 148 199
Mathematica 199
Ruby 185 205
Unix Toolchain 194 228
Python 183 243
Clojure 282
Scala 311
Haskell 333
Awk 336
R 298
Javascript 304 354
Groovy 321
Matlab 404
C# 422
Smalltalk 386
PHP 450
F# 452
TSQL 483 507
数字表示特定语言中最短解决方案的长度. 严格"是指完全实现规范的解决方案(绘制|____|
条,用____
线关闭顶部的第一条条,考虑长单词出现频率高的可能性等). 放松"是指采取一些自由来缩短解决时间.
The numbers represent the length of the shortest solution in a specific language. "Strict" refers to a solution that implements the spec completely (draws |____|
bars, closes the first bar on top with a ____
line, accounts for the possibility of long words with high frequency etc). "Relaxed" means some liberties were taken to shorten to solution.
仅包含少于500个字符的解决方案.语言列表按严格"解决方案的长度排序. "Unix工具链"用于表示使用传统* nix shell plus 混合工具(例如grep,tr,sort,uniq,head,perl,awk)的各种解决方案.
Only solutions shorter then 500 characters are included. The list of languages is sorted by the length of the 'strict' solution. 'Unix Toolchain' is used to signify various solutions that use traditional *nix shell plus a mix of tools (like grep, tr, sort, uniq, head, perl, awk).
推荐答案
LabVIEW 51个节点,5个结构,10个图
教大象进行踢踏舞从来都不是一件漂亮的事.我会啊,跳过字符数.
LabVIEW 51 nodes, 5 structures, 10 diagrams
Teaching the elephant to tap-dance is never pretty. I'll, ah, skip the character count.
程序从左向右流动:
这篇关于建立给定文本中最常用单词的ASCII图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!