字数:麦克罗伊的解决方案效率如何? [英] WordCount: how inefficient is McIlroy's solution?

查看:66
本文介绍了字数:麦克罗伊的解决方案效率如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

长话短说:1986年,一名访调员要求唐纳德·努斯(Donald Knuth)编写一个程序,该程序以文本和N作为输入,并列出按频率排列的N个最常用词。 Knuth制作了一个10页的Pascal程序,Douglas McIlroy用以下6行shell脚本回答了该问题:

Long story short: in 1986 an interviewer asked Donald Knuth to write a program that takes a text and a number N in input, and lists the N most used words sorted by their frequencies. Knuth produced a 10-pages Pascal program, to which Douglas McIlroy replied with the following 6-lines shell script:

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

http://www.leancrew.com/all-this/2011/12/more-shell-less -egg /

当然,他们有非常不同的目标:Knuth展示了他的识字编程概念,并从零开始构建了一切,而McIlroy使用了一些

Of course they had very different goals: Knuth was showing his concepts of literate programming and built everything from scratch, while McIlroy used a few common UNIX utilities to achieve the shortest source code.

我的问题是:那有多糟?

(从运行时的速度来看,观点,因为我很确定我们都同意6行代码比10页的页面更容易理解/维护,无论是否识字编程。)

My question is: how bad is that?
(Purely from a runtime-speed point of view, since I'm pretty sure we all agree that 6 lines of code is easier to understand/maintain than 10 pages, literate programming or not.)

我可以理解 sort -rn | sed $ {1} q 可能不是提取常用词的最有效方法,但是 tr -sc A-za-z’nn | tr A-Z a-z ?对我来说看起来不错。
关于排序| uniq -c ,这是确定频率的非常慢的方法吗?

I can understand that sort -rn | sed ${1}q may not be the most efficient way to extract the common words, but what's wrong with tr -sc A-za-z '\n' | tr A-Z a-z? It looks pretty good to me. About sort | uniq -c, is that a terribly slow way to determine the frequencies?

一些注意事项:


  • tr 应该是线性时间(?)

  • sort 我不确定,但是我认为这不是不好

  • uniq 也应该是线性时间

  • 产卵进程也应该是线性时间(以进程数计)

  • tr should be linear time (?)
  • sort I'm not sure of, but I'm assuming it's not that bad
  • uniq should be linear time too
  • spawning processes should be linear time (in the number of processes)

推荐答案

Unix 脚本具有一些线性操作和2种类型。它将是计算顺序 O(n log(n))

The Unix script has a few linear operations and 2 sorts. It will be calculation order O(n log(n)).

对于Knuth算法,仅取最高值N: http://en.wikipedia.org/wiki/Selection_algorithm
其中您可以在算法的时间和空间复杂度上有一些选择,但是从理论上讲,对于一些带有大量(不同)单词的典型示例,它们可以更快。

For Knuth algorithm for taking only the top N: http://en.wikipedia.org/wiki/Selection_algorithm Where you can have a few options in time and space complexity of the algorithm, but theoretically they can be faster for some typical examples with large number of (different) words.

所以Knuth可能更快。当然是因为英语词典的大小有限。它可能会在一个较大的常量中变成 log(n),尽管可能会占用很多内存。

So Knuth could be faster. Certainly because the English dictionary has limited size. It could turn log(n) in some large constant, though maybe consuming a lot of memory.

此问题更适合 https://cstheory.stackexchange.com/

这篇关于字数:麦克罗伊的解决方案效率如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆