在python中将数据库表写入文件的最快方法 [英] Fastest way to write database table to file in python

查看:52
本文介绍了在python中将数据库表写入文件的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从数据库中提取大量数据并将其写入csv文件.我正在尝试找出最快的方法是这样做.我发现在fetchall的结果上运行writerows比下面的代码慢40%.

I'm trying to extract huge amounts of data from a DB and write it to a csv file. I'm trying to find out what the fastest way would be to do this. I found that running writerows on the result of a fetchall was 40% slower than the code below.

with open(filename, 'a') as f:
    writer = csv.writer(f, delimiter='\t')
    cursor.execute("SELECT * FROM table")
    writer.writerow([i[0] for i in cursor.description])

    count = 0
    builder = []
    row = cursor.fetchone()
    DELIMITERS = ['\t'] * (len(row) - 1) + ['\n']
    while row:
        count += 1
        # Add row with delimiters to builder 
        builder += [str(item) for pair in zip(row, DELIMITERS) for item in pair]
        if count == 1000:
            count = 0
            f.write(''.join(builder))
            builder[:] = []
        row = cursor.fetchone()
    f.write(''.join(builder))

我正在使用的数据库是我所工作的小型公司所独有的,因此,不幸的是,我在这方面无法提供太多信息.我使用jpype连接数据库,因为连接的唯一方法是通过jdbc驱动程序.我正在运行cPython 2.7.5;很想使用PyPy,但不适用于Pandas.

The database I'm using is unique to the small company that I'm working for, so unfortunately I can't provide much information on that front. I'm using jpype to connect with the database since the only means of connecting is via a jdbc driver. I'm running cPython 2.7.5; would love to use PyPy but it doesn't work with Pandas.

由于我要提取大量行,因此我在使用fetchall时犹豫不决,因为担心会耗尽内存. row 具有可比的性能,并且在眼睛上容易得多,所以我认为我会使用它.谢谢一堆!

Since I'm extracting such a large number of rows, I'm hesitant to use fetchall for fear that I'll run out of memory. row has comparable performance and is much easier on the eyes, so I think I'll use that. Thanks a bunch!

推荐答案

您已经给我们提供了一些帮助,很难更具体了,但是……

With the little you've given us to go on, it's hard to be more specific, but…

我将您的代码包装为一个函数,并编写了三个替代版本:

I've wrapped your code up as a function, and written three alternative versions:

def row():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        for row in cursor:
            writer.writerow(row)

def rows():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        writer.writerows(cursor)

def rowsall():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        writer.writerows(cursor.fetchall())

请注意,最后一个是您说过的内容.

Notice that the last one is the one you say you tried.

现在,我编写了这个测试驱动程序:

Now, I wrote this test driver:

def randomname():
    return ''.join(random.choice(string.ascii_lowercase) for _ in range(30))

db = sqlite3.connect(':memory:')
db.execute('CREATE TABLE mytable (id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR)')
db.executemany('INSERT INTO mytable (name) VALUES (?)',
               [[randomname()] for _ in range(10000)])

filename = 'db.csv'

for f in manual, row, rows, rowsall:
    t = timeit.timeit(f, number=1)
    print('{:<10} {}'.format(f.__name__, t))

结果如下:

manual     0.055549702141433954
row        0.03852885402739048
rows       0.03992213006131351
rowsall    0.02850699401460588

因此,在我的测试中,您的代码花费的时间几乎是调用 fetchall writerows 的时间的两倍!

So, your code takes nearly twice as long as calling fetchall and writerows in my test!

但是,当我对其他数据库重复进行类似的测试时, rowsall 的速度要比 manual 慢20%到15%(从不慢40%,但是高达15%)…,但是 row rows 总是比 manual 快得多.

When I repeat a similar test with other databases, however, rowsall is anywhere from 20% faster to 15% slower than manual (never 40% slower, but as much as 15%)… but row or rows is always significantly faster than manual.

我认为这是因为您的自定义代码比 csv.writerows 慢得多,但是在某些数据库中,使用 fetchall 而不是 fetchone (或仅迭代光标)会大大降低速度.对于内存中的sqlite3数据库而言,这不是正确的原因是 fetchone fetchall 做着所有相同的工作,然后一次向您提供列表;对于远程数据库, fetchone 可能会执行所有操作,从获取所有行到一次获取缓冲区,再到一次获取一行,使其可能比 fetchall慢或快得多.,具体取决于您的数据.

I think the explanation is that your custom code is significantly slower than csv.writerows, but that in some databases, using fetchall instead of fetchone (or just iterating the cursor) slows things down significantly. The reason this isn't true with an in-memory sqlite3 database is that fetchone is doing all of the same work as fetchall and then feeding you the list one at a time; with a remote database, fetchone may do anything from fetch all the lines, to fetching a buffer at a time, to fetching a row at a time, making it potentially much slower or faster than fetchall, depending on your data.

但是对于真正有用的解释,您必须确切地告诉我们您正在使用哪个数据库和库(以及哪个Python版本-CPython 3.3.2的 csv 模块似乎很多速度比CPython 2.7.5更快,而PyPy 2.1/2.7.2似乎也比CPython 2.7.5快,但是任何一个也可能也可以更快地运行您的代码……)等等.

But for a really useful explanation, you'd have to tell us exactly which database and library you're using (and which Python version—CPython 3.3.2's csv module seems to be a lot faster than CPython 2.7.5's, and PyPy 2.1/2.7.2 seems to be faster than CPython 2.7.5 as well, but then either one also might run your code faster too…) and so on.

这篇关于在python中将数据库表写入文件的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆