二进制搜索具有未知行长度的巨大文件 [英] Binary search over a huge file with unknown line length
问题描述
我正在使用庞大的数据CSV文件。每个文件包含数百万条记录,每条记录都有一个密钥。记录按其密钥排序。我不想在搜索certian数据时查看整个文件。
我见过这个解决方案:用Python阅读巨大的文件
I'm working with huge data CSV file. Each file contains milions of record , each record has a key. The records are sorted by thier key. I dont want to go over the whole file when searching for certian data. I've seen this solution : Reading Huge File in Python
但是它表明你在文件上使用相同长度的行 - 在我的情况下不支持。
But it suggests that you use the same length of lines on the file - which is not supported in my case.
我想在每行添加一个填充然后保持一个固定的行长度,但我想知道是否有更好的方法。
I thought about adding a padding to each line and then keeping a fixed line length , but I'd like to know if there is a better way to do it.
我正在使用python
I'm working with python
推荐答案
您不必拥有固定宽度记录,因为您不必执行此操作面向记录的搜索。相反,您可以只进行面向字节的搜索,并确保在进行搜索时重新对齐键。这是一个(可能是错误的)示例,说明如何修改链接到从面向记录到面向字节的解决方案:
You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:
bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
left, right = 0, bytes - 1
key = None
while key != search and left <= right:
mid = (left + right) / 2
fin.seek(mid)
# now realign to a record
if mid:
fin.readline()
key, value = map(int, fin.readline().split())
if search > key:
left = mid + 1
else:
right = mid - 1
if key != search:
value = None # for when search key is not found
search.result = value # store the result of the search
这篇关于二进制搜索具有未知行长度的巨大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!