在大型CSV文件中,最大限度地减少python的搜索时间 [英] Minimise search time for python in a large CSV file

查看:110
本文介绍了在大型CSV文件中,最大限度地减少python的搜索时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个约有700行左右和8列的CSV文件,但是最后一列有很大的文本块(每个文本块中足以容纳多个长段落).

I have a CSV file with about 700 or so rows and 8 columns, the last column however, has a very big block of text (with enough for multiple long paragraphs inside each).

我想通过python实现一个文本搜索功能,该功能可以为我返回所有具有与第8列数据内部匹配的文本的行(这意味着它需要遍历整个过程).

I'd like to implement through python a text-search function that gives me back all the lines that have text that matches from inside the data from the 8th column (meaning it'd need to go through the whole thing).

解决这个问题并最大程度地减少搜索时间的最快方法可能是什么?

What could possibly be the quickest way to approach this and minimise search-time?

推荐答案

您可以将csv文件转储到 sqlite 数据库,并使用sqlite的全文搜索功能进行搜索为你.

You could dump your csv file into an sqlite database and use sqlite's full text search capabilities to do the search for you.

此示例代码显示了如何完成此操作.有几件事要注意:

This example code shows how it could be done. There are a few things to be aware of:

  • 它假定csv文件具有标题行,并且标题的值将在sqlite中成为合法的列名.如果不是这种情况,则需要用引号引起来(或仅使用"col1","col2"等通用名称).
  • 它搜索csv中的 all 列;如果不希望这样做,请在创建SQL语句之前过滤掉其他列(和标头值).
  • 如果要使结果与csv文件中的行匹配,则需要创建一个包含行号的列.
  • it assumes that the csv file has a header row, and that the values of the headers will make legal column names in sqlite. If this isn't the case, you'll need to quote them (or just use generic names like "col1", "col2" etc).
  • it searches all columns in the csv; if that's undesirable, filter out the other columns (and header values) before creating the SQL statements.
  • If you want to be able to match the results to rows in the csv file, you'll need create a column that contains the line number.
import csv
import sqlite3
import sys


def create_table(conn, headers, name='mytable'):
    cols = ', '.join([x.strip() for x in headers])
    stmt = f"""CREATE VIRTUAL TABLE {name} USING fts5({cols})"""
    with conn:
        conn.execute(stmt)
    return


def populate_table(conn, reader, ncols, name='mytable'):
    placeholders = ', '.join(['?'] * ncols)
    stmt = f"""INSERT INTO {name}
    VALUES ({placeholders})
    """
    with conn:
        conn.executemany(stmt, reader)
    return


def search(conn, term, headers, name='mytable'):
    cols = ', '.join([x.strip() for x in headers])
    stmt = f"""SELECT {cols}
    FROM {name}
    WHERE {name} MATCH ?
    """
    with conn:
        cursor = conn.cursor()
        cursor.execute(stmt, (term,))
        result = cursor.fetchall()
    return result


def main(path, term):
    result = 'NO RESULT SET'
    try:
        # Create an in-memory database.
        conn = sqlite3.connect(':memory:')
        with open(path, 'r') as f:
            reader = csv.reader(f)
            # Assume headers are in the first row
            headers = next(reader)
            create_table(conn, headers)
            ncols = len(headers)
            populate_table(conn, reader, ncols)
        result = search(conn, term, headers)
    finally:
        conn.close()
    return result


if __name__ == '__main__':
    print(main(*sys.argv[1:]))

这篇关于在大型CSV文件中,最大限度地减少python的搜索时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆