查找一个文件N个最大行:你会如何加以改善? [英] Find N largest lines from a file: how would you make this better?

查看:165
本文介绍了查找一个文件N个最大行:你会如何加以改善?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是从一个潜在的雇主提交该code后,最近被拒绝。他们建议我是不是技术上有足够的能力。我想知道,如果有人可以阐明如何使这更好/更高效的光。

现在的问题是要找到从多行的文件N个最长的线路。这最终归结为一个排序问题,所以我建立了一个算法,从数字的列表作为这样找了N人数最多的:

 高清选择(数字,N):

    最大= []

    x的范围(0,n)的:

        maximum.append(数字[X])
        IND = X

        y的范围在(X,LEN(数字)):
            如果[Y]>最大[LEN(最大值)-1]:
                最大[LEN(最大值)-1] =号[Y]
                号码[IND],数字[Y] =号[Y],数字[IND]

    收益最大
 

这在运行的 O(N)的,除非N = N,在这种情况下,它运行在的为O(n ^ 2)的。我很惊讶地听到他们怀疑自己的技术能力,所以我想我把它给你。如何使这更好的?

编辑:感谢您的反馈。为了澄清:我填充了线由行字计数从文件的列表,并通过这个功能运行它

EDIT2:有些人提到的语法。我只是一直在做的Python大约一两天。我的老板建议我把它写在Python(和我说过,我不知道的Python),所以我认为小的语法错误和方法不会是这样一个问题。

EDIT3:原来我最初的推理是有缺陷的选择排序。我曾在我的头上,一个最小堆将是nlogn,但我忘了,我的code的平均复杂度为n ^ 2。感谢大家的帮助。

解决方案

 从heapq进口nlargest

高清longest_lines(N,文件名):
    开放(文件名)输入:
        返回nlargest(N,输入,键= LEN)
 

好吧,处理意见如下:

 高清longest_lines(N,文件名):
    堆= []
    开放(文件名)输入:
        为LN的文件名:
            推(堆,LN)
            如果len(堆)> N:
                POP(堆)
    回报堆
 

其中,流行都是好老分堆的插入和删除最小,可以发现算法在任何一本教科书(而且我从来没有得到正确的一步到位,所以我不张贴现在),通过它们的长度比较行。这将运行在O( N 的×LG( N 的))时间,其中的 N 的是文件中的行数,消耗O( N 的)临时空间。

请注意所产生的列表不按长度排序,但补充说,可以通过弹出的元素,直到堆是空的,扭转这个结果来实现。

I was recently rejected from a potential employer after submitting this code. They suggested I wasn't technically capable enough. I'm wondering if someone could shed light on to how to make this better/more efficient.

The question was to find the N longest lines from a file of multiple lines. This ultimately boiled down to a sorting problem, so I built an algorithm to find the N largest numbers from a list of numbers as so:

def selection(numbers, n):

    maximum = []

    for x in range (0, n):

        maximum.append(numbers[x])
        ind = x

        for y in range ( x, len(numbers) ):
            if numbers[y] > maximum[len(maximum)-1]:
                maximum[len(maximum)-1] = numbers[y]
                numbers[ind], numbers[y] = numbers[y], numbers[ind]

    return maximum

This runs in O(n), unless N = n, in which case it runs in O(n^2). I was surprised to hear them doubt my technical abilities, so I thought I'd bring it to you SO. How do I make this better?

EDIT: Thanks for the feedback. To clarify: I populated a list with the line-by-line word-counts from the file, and ran it through this function.

EDIT2: Some people mentioned syntax. I've only been doing Python for about a day or two. My employer suggested I write it in Python (and I mentioned that I didn't know Python), so I assumed small syntax errors and methods wouldn't be such an issue.

EDIT3: Turns out my initial reasoning was flawed with the selection sort. I had it in my head that a min-heap would be nlogn, but I forgot that the average complexity for my code is n^2. Thanks for the help everyone.

解决方案

from heapq import nlargest

def longest_lines(n, filename):
    with open(filename) as input:
        return nlargest(n, input, key=len)

Alright, addressing the comments below:

def longest_lines(n, filename):
    heap = []
    with open(filename) as input:
        for ln in filename:
            push(heap, ln)
            if len(heap) > n:
                pop(heap)
    return heap

where push and pop are the good old min-heap insert and delete-min algorithms that can be found in any textbook (and that I never get right in one go, so I'm not posting them now), comparing lines by their length. This runs in O(N×lg(n)) time where N is the number of lines in the file, consuming O(n) temporary space.

Note that the resulting list is not sorted by length, but adding that can be done by popping the elements until the heap is empty and reversing the result of that.

这篇关于查找一个文件N个最大行:你会如何加以改善?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆