转换为Logn Python 3.7 [英] Conversion to Logn Python 3.7

查看:78
本文介绍了转换为Logn Python 3.7的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这段代码可以很好地工作并满足我的要求,但是它以线性形式完成,这是减慢数据文件大小的一种方式,因此我想将其转换为Log.我尝试了此代码,并在此处发布了许多其他代码,但仍无法使它正常工作.我将发布两组代码,并给出我期望的示例.

I have this code that works great and does what I want, however it does it in linear form which is way to slow for the size of my data files so I want to convert it to Log. I tried this code and many others posted here but still no luck at getting it to work. I will post both sets of code and give examples of what I expect.

import pandas
import fileinput

'''This code runs fine and does what I expect removing duplicates from big 
file that are in small file, however it is a linear function.'''

with open('small.txt') as fin:
    exclude = set(line.rstrip() for line in fin)
    for line in fileinput.input('big.txt', inplace=True):
        if line.rstrip() not in exclude:
            print(line, end='')
        else:
            print('')

'''This code is my attempt at conversion to a log function.'''


def log_search(small, big):
    first = 0
    last = len(big.txt) - 1
    while first <= last:
        mid = (first + last) / 2
        if str(mid) == small.txt:
            return True
        elif small.txt < str(mid):
            last = mid - 1
        else:
            first = mid + 1
    with open('small.txt') as fin:
        exclude = set(line.rstrip() for line in fin)
        for line in fileinput.input('big.txt', inplace=True):
            if line.rstrip() not in exclude:
                print(line, end='')
            else:
                print('')
            return log_search(small, big)

  1. 大文件具有数百万行的int数据.
  2. 小文件包含数百行int数据.
  3. 比较数据并删除大文件中的重复数据,但将行号保留为空.

运行第一段代码可以工作,但是搜索大文件花费的时间太长.也许我以错误的方式解决了这个问题.我尝试将其转换为日志,但没有错误,但无济于事.

running the first block of code works but it takes too long to search through the big file. Maybe I am approaching the problem in a wrong way. My attempt at converting it to log runs without error but does nothing.

推荐答案

我不认为有比第一种方法更好或更快的方法. (更新:有,见下文.)将small.txt中的行存储在set中并迭代big.txt中的行,检查它们是否在该集中,将具有O(b)的复杂度,其中big.txt中的行数.

I don't think there is a better or faster way to do this that what you are currently doing in your first approach. (Update: There is, see below.) Storing the lines from small.txt in a set and iterating the lines in big.txt, checking whether they are in that set, will have complexity of O(b), with b being the number of lines in big.txt.

您似乎正在尝试通过使用二进制搜索来检查small.txt中的每一行是否将s减少为O(s*logb),而将s设为small.txt中的行数. c2>,然后删除/覆盖它.

What you seem to be trying is to reduce this to O(s*logb), with s being the number of lines in small.txt, by using binary search to check for each line in small.txt whether it is in big.txt and removing/overwriting it then.

如果所有行都位于list中并且可以随机访问任何数组,那么这将很好地工作,但是您只有文件,不允许随机访问任何行.但是,它确实允许使用file.seek随机访问任何字符,(至少在某些情况下?)

This would work well if all the lines were in a list with random access to any array, but you have just the file, which does not allow random access to any line. It does, however, allow random access to any character with file.seek, which (at least in some cases?) seems to be O(1). But then you will still have to find the previous line break to that position before you can actually read that line. Also, you can not just replace lines with empty lines, but you have to overwrite the number with the same number of characters, e.g. spaces.

是的,理论上,如果您执行以下操作,则可以在O(s*logb)中完成:

So, yes, theoretically it can be done in O(s*logb), if you do the following:

  • 实现二进制搜索,不搜索行,而是搜索大文件的字符
    • 对于每个位置,回溯到最后一个换行符,然后读取该行以获取编号
    • 使用二分搜索,像往常一样在下半部/上半部再次尝试
    • implement binary search, searching not on the lines, but on the characters of the big file
      • for each position, backtrack to the last line break, then read the line to get the number
      • try again in the lower/upper half as usual with binary search

      在我的系统上,读写具有1000万行数字的文件每次仅花费3秒,而使用fileinput.inputprint花费大约8秒.因此,恕我直言,这确实不值得付出努力,但这当然取决于您执行此操作的频率.

      On my system, reading and writing a file with 10 million lines of numbers only took 3 seconds each, or about 8 seconds with fileinput.input and print. Thus, IMHO, this is not really worth the effort, but of course this may depend on how often you have to do this operation.

      好的,所以我很好奇自己-谁还需要午休?-所以我尝试实现了这一点……而且效果非常好.这将在文件中找到给定的编号,并将其替换为符合编号的-(不是只是一个空行,如果不重写整个文件,这是不可能的).请注意,我没有针对边缘情况,一次错误的错误等对二进制搜索算法进行彻底的测试.

      Okay, so I got curious myself --and who needs a lunch break anyway?-- so I tried to implement this... and it works surprisingly well. This will find the given number in the file and replace it with an accordant number of - (not just a blank line, that's impossible without rewriting the entire file). Note that I did not thoroughly test the binary-search algorithm for edge cases, off-by-one erros etc.

      import os
      
      def getlineat(f, pos):
          pos = f.seek(pos)
          while pos > 0 and f.read(1) != "\n":
              pos = f.seek(pos-1)
          return pos+1 if pos > 0 else 0
      
      def bsearch(f, num):
          lower = 0
          upper = os.stat(f.name).st_size - 1
          while lower <= upper:
              mid = (lower + upper) // 2
              pos = getlineat(f, mid)
              line = f.readline()
              if not line: break # end of file
              val = int(line)
              if val == num:
                  return (pos, len(line.strip()))
              elif num < val:
                  upper = mid - 1
              elif num > val:
                  lower = mid + 1 
          return (-1, -1)
      
      def overwrite(filename, to_remove):
          with open(filename, "r+") as f:
              positions = [bsearch(f, n) for n in to_remove]
              for n, (pos, length) in sorted(zip(to_remove, positions)):
                  print(n, pos)
                  if pos != -1:
                      f.seek(pos)
                      f.write("-" * length)
      
      import random
      to_remove = [random.randint(-500, 1500) for _ in range(10)]
      overwrite("test.txt", to_remove)
      

      这将首先收集所有要覆盖的位置,然后在第二个stes中进行实际的覆盖,否则二进制搜索遇到先前已删除"的行之一时将出现问题.我用一个文件进行了测试,该文件按排序顺序保存了从0到1,000的所有数字,并删除了一个随机数列表(有界和无界),并且效果很好.

      This will first collect all the positions to be overwritten, and then do the actual overwriting in a second stes, otherwise the binary search will have problems when it hits one of the previously "removed" lines. I tested this with a file holding all the numbers from 0 to 1,000 in sorted order and a list of random numbers (both in- and out-of-bounds) to be removed and it worked just fine.

      更新:还对文件进行了测试,该文件具有从0到100,000,000的随机数(已排序)(944 MB)并覆盖100个随机数,并且立即完成,因此至少应确实为O(s * logb)在我的系统上(file.seek的复杂性可能取决于文件系统,文件类型等).

      Update: Also tested it with a file with random numbers from 0 to 100,000,000 in sorted order (944 MB) and overwriting 100 random numbers, and it finished immediately, so this should indeed be O(s*logb), at least on my system (the complexity of file.seek may depend on file system, file type, etc.).

      bsearch函数也可以通用为接受另一个参数value_function,而不是硬编码val = int(line).然后可以将其用于在任意文件中进行二进制搜索,例如只要行按相同的值函数排序,庞大的词典,基因数据库,csv文件等.

      The bsearch function could also be generalized to accept another parameter value_function instead of hardcoding val = int(line). Then it could be used for binary-searching in arbitrary files, e.g. huge dictionaries, gene databases, csv files, etc., as long as the lines are sorted by that same value function.

      这篇关于转换为Logn Python 3.7的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆