搜索csv文件的最快方法是什么? [英] What is the fastest way to search the csv file?

查看:125
本文介绍了搜索csv文件的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任务:检查文件中系列和护照号码的可用性.

Task: Check the availability of the series and passport number in the file.

我的决定如下:

def check_passport(filename, series: str, number: str) -> dict:
    """
    Find passport number and series
    :param filename:csv filename path
    :param series: passport series
    :param number: passport number
    :return:
    """
    print(f'series={series}, number={number}')
    find = False
    start = datetime.datetime.now()
    with open(filename, 'r', encoding='utf_8_sig') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        try:
            for row in reader:
                if row[0] == series and row[1] == num:
                    print(row[0])
                    print(row[1])
                    find = True
                    break
        except Exception as e:
            print(e)
    print(datetime.datetime.now() - start)
    if find:
        return {'result': False, 'message': f'Passport found'}
    else:
        return {'result': False, 'message': f'Passport not found in Database'}

这是csv文件的一部分

This is part of the csv file

PASSP_SERIES,PASSP_NUMBER
3604,015558 
6003,711925 
6004,461914 
6001,789369

如果文件中没有护照,则时间更糟,因为您需要检查所有行.我最好的时间是53秒.

If you do not have your passport in the file, the time is worse, as you need to check all the lines. My best time is 53 seconds.

推荐答案

CSV文件格式是一种方便且简单的文件格式.

CSV file format is a convenient and simple file format.

它不是用于分析/快速搜索的,这从来都不是目标. 在必须处理所有条目或条目数量不是很大的不同应用程序和任务之间进行交换非常有用.

It is not intended for analysis / fast searching , This was never the goal. It is good for exchange between different applications and tasks where you have to process all entries or where the amount of entries is not very huge.

如果要加快速度,则应读取CSV文件一次并将其转换为数据库,例如sqlite,然后在数据库中执行所有搜索. 如果密码号是唯一的,那么您甚至可以只使用一个简单的dbm文件或一个python搁架.

If you want to speed up you should read the CSV file once and convert it to a database e.g. sqlite and then perform all the searches in the data base. if password numbers are unique, then you could even just use a simple dbm file or a python shelve.

可以通过向要搜索的字段添加索引来优化数据库性能.

Database performance can be optimized by adding indexes to fields, that you search for.

这一切都取决于CSV文件更改的频率和执行搜索的频率,但是这种方法通常应能产生更好的结果.

It all depends how often the CSV file changes and how often you perform searches, but often this approach should yield better results.

我从没真正使用过熊猫,但也许它在搜索/过滤方面表现更好,尽管它永远不会比在真实数据库中搜索更好.

I never really used pandas, but perhaps it is more performant for searching / filtering, though it will never beat searching in a real database.

如果您想走sqlite或dbm的道路,我可以提供一些代码帮助.

If you want to go down the sqlite or dbm road I can help with some code.

附录(在使用csv阅读器阅读之前,使用bisect搜索在已排序的csv文件中进行搜索):

如果csv文件中的第一个字段是序列号,那么还有另一种方法. (或者,如果您愿意转换csv文件,那么可以使用gnu sort对其进行排序)

If the first field in your csv file is the serial number, then there is another approach. (or if you are willing to transform the csv file such, that it can be sorted with gnu sort)

只需对文件进行排序(在Linux系统上可以很容易地使用gnu排序.它可以对大型文件进行排序而无需消耗"内存),并且排序时间不应比您在搜索时所花费的时间高很多那一刻.

Just sort your file (easy to do with a gnu sort on a linux system. It can sort huge files without 'explosing' the memory) and the sorting time should not be much higher then the search time that you are having at the moment.

然后使用bisect/seek搜索在文件中搜索具有正确序列号的第一行.然后对现有功能进行少量修改.

And use then a bisect / seek search in your file for the first line with the right serial number. Then use your existing function with a minor modification.

这将在几毫秒内为您提供结果. 我尝试了一个随机创建的csv文件,该文件具有3000万个条目,大小约为1.5G.

This will give you results within a few milliseconds. I tried with a randomly created csv file with 30 million entries and a size of about 1.5G.

如果在linux系统上运行,则您甚至可以更改代码,这样只要csv文件发生更改,它就会创建您下载的csv文件的排序副本. (在我的计算机上排序大约需要1到2分钟),因此,每周进行2到3次搜索之后,这是值得的.

If running on a linux system you could even change your code such, that it creates a sorted copy of the csv file, that you downloaded, whenever the csv file changed. (Sorting on my machine needed about 1 to 2 minutes) So after 2 to 3 searches per week this would be worth the effort.

import csv
import datetime
import os

def get_line_at_pos(fin, pos):
    """ fetches first complete line at offset pos
        always skips header line
    """
    fin.seek(pos)
    skip = fin.readline()
    # next line for debugging only
    # print("Skip@%d: %r" % (pos, skip))
    npos = fin.tell()
    assert pos + len(skip) == npos
    line = fin.readline()
    return npos, line


def bisect_seek(fname, field_func, field_val):
    """ returns a file postion, which guarantees, that you will
        encounter all lines, that migth encounter field_val
        if the file is ordered by field_val.
        field_func is the function to extract field_val from a line
        The search is a bisect search, with a complexity of log(n)
    """
    size = os.path.getsize(fname)
    minpos, maxpos, cur = 0, size, int(size / 2)

    with open(fname) as fin:
        small_pos = 0
        # next line just for debugging
        state = "?"
        prev_pos = -1
        while True:  # find first id smaller than the one we search
            # next line just for debugging
            pos_str = "%s %10d %10d %10d" % (state, minpos, cur, maxpos)
            realpos, line = get_line_at_pos(fin, cur)
            val = field_func(line)
            # next line just for debugging
            pos_str += "# got @%d: %r %r" % (realpos, val, line)
            if val >= field_val:
                state = ">"
                maxpos = cur
                cur = int((minpos + cur) / 2)
            else:
                state = "<"
                minpos = cur
                cur = int((cur + maxpos) / 2)
            # next line just for debugging
            # print(pos_str)
            if prev_pos == cur:
                break
            prev_pos = cur
    return realpos


def getser(line):
    return line.split(",")[0]


def check_passport(filename, series: str, number: str) -> dict:
    """
    Find passport number and series
    :param filename:csv filename path
    :param series: passport series
    :param number: passport number
    :return:
    """
    print(f'series={series}, number={number}')
    found = False
    start = datetime.datetime.now()
    # find position from which we should start searching
    pos = bisect_seek(filename, getser, series)
    with open(filename, 'r', encoding='utf_8_sig') as csvfile:
        csvfile.seek(pos)
        reader = csv.reader(csvfile, delimiter=',')
        try:
            for row in reader:
                if row[0] == series and row[1] == number:
                    found = True
                    break
                elif row[0] > series:
                    # as file is sorted we know we can abort now
                    break
        except Exception as e:
            print(e)
    print(datetime.datetime.now() - start)
    if found:
        print("good row", row)
        return {'result': True, 'message': f'Passport found'}
    else:
        print("bad row", row)
        return {'result': False, 'message': f'Passport not found in Database'}

附录2019-11-30: 这里有一个脚本将您的大文件分割成较小的块,并对每个块进行排序. (我不想实施完全合并排序,因为在这种情况下,在每个块中进行搜索已经足够高效.如果对mor感兴趣,我建议尝试实施合并排序或发布有关在Windows下对大型文件进行排序的问题使用python)

Addendum 2019-11-30: Here one script to split your huge file into smaller chunks and sort each of the chunks. (I didn't want to implement a full merge sort as in this context searching in each of the chunks is already efficient enough. if interested in mor I suggest to try to implement a merge sort or post a question about sorting huge files under windows with python)

split_n_sort_csv.py:

import itertools
import sys
import time

def main():
    args = sys.argv[1:]
    t = t0 = time.time()
    with open(args[0]) as fin:
        headline = next(fin)
        for idx in itertools.count():
            print(idx, "r")
            tprev = t
            lines = list(itertools.islice(fin, 10000000))
            t = time.time()
            t_read = t - tprev
            tprev = t
            print("s")
            lines.sort()
            t = time.time()
            t_sort = t - tprev
            tprev = t
            print("w")
            with open("bla_%03d.csv" % idx, "w") as fout:
                fout.write(headline)
                for line in lines:
                    fout.write(line)
            t = time.time()
            t_write = t - tprev
            tprev = t

            print("%4.1f %4.1f %4.1f" % (t_read, t_sort, t_write))
            if not lines:
                break
    t = time.time()
    print("Total of %5.1fs" % (t-t0))

if __name__ == "__main__":
    main()

这里是修改后的版本,可以搜索所有块文件.

And here a modified version, that searches in all chunk files.

import csv
import datetime
import itertools
import os

ENCODING='utf_8_sig'

def get_line_at_pos(fin, pos, enc_encoding="utf_8"):
    """ fetches first complete line at offset pos
        always skips header line
    """
    while True:
        fin.seek(pos)
        try:
            skip = fin.readline()
            break
        except UnicodeDecodeError:
            pos += 1

    # print("Skip@%d: %r" % (pos, skip))
    npos = fin.tell()
    assert pos + len(skip.encode(enc_encoding)) == npos
    line = fin.readline()
    return npos, line

def bisect_seek(fname, field_func, field_val, encoding=ENCODING):
    size = os.path.getsize(fname)
    vmin, vmax, cur = 0, size, int(size / 2)
    if encoding.endswith("_sig"):
        enc_encoding = encoding[:-4]
    else:
        enc_encoding = encoding
    with open(fname, encoding=encoding) as fin:
        small_pos = 0
        state = "?"
        prev_pos = -1
        while True:  # find first id smaller than the one we search
            # next line only for debugging
            pos_str = "%s %10d %10d %10d" % (state, vmin, cur, vmax)
            realpos, line = get_line_at_pos(fin, cur, enc_encoding=enc_encoding)
            val = field_func(line)
            # next line only for debugging
            pos_str += "# got @%d: %r %r" % (realpos, val, line)
            if val >= field_val:
                state = ">"
                vmax = cur
                cur = int((vmin + cur) / 2)
            else:
                state = "<"
                vmin = cur
                cur = int((cur + vmax) / 2)
            # next line only for debugging
            # print(pos_str)
            if prev_pos == cur:
                break
            prev_pos = cur
    return realpos

def getser(line):
    return line.split(",")[0]

def check_passport(filename, series: str, number: str,
        encoding=ENCODING) -> dict:
    """
    Find passport number and series
    :param filename:csv filename path
    :param series: passport series
    :param number: passport number
    :return:
    """
    print(f'series={series}, number={number}')
    found = False
    start = datetime.datetime.now()
    for ctr in itertools.count():
        fname = filename % ctr
        if not os.path.exists(fname):
            break
        print(fname)
        pos = bisect_seek(fname, getser, series)
        with open(fname, 'r', encoding=encoding) as csvfile:
            csvfile.seek(pos)
            reader = csv.reader(csvfile, delimiter=',')
            try:
                for row in reader:
                    if row[0] == series and row[1] == number:
                        found = True
                        break
                    elif row[0] > series:
                        break
            except Exception as e:
                print(e)
        if found:
            break
    print(datetime.datetime.now() - start)
    if found:
        print("good row in %s: %d", (fname, row))
        return {'result': True, 'message': f'Passport found'}
    else:
        print("bad row", row)
        return {'result': False, 'message': f'Passport not found in Database'}

要进行测试,请致电:

check_passport("bla_%03d.csv", series, number)

这篇关于搜索csv文件的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆