在Python中合并具有数百万行的两个表 [英] Merging two tables with millions of rows in Python

查看:102
本文介绍了在Python中合并具有数百万行的两个表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python进行一些数据分析.我有两个表,第一个(叫它"A")有1000万行和10列,第二个("B")有7300万行和2列.他们有1个具有共同ID的列,我想根据该列将两个表相交.特别是我想要表的内部联接.

I am using Python for some data analysis. I have two tables, the first (let's call it 'A') has 10 million rows and 10 columns and the second ('B') has 73 million rows and 2 columns. They have 1 column with common ids and I want to intersect the two tables based on that column. In particular I want the inner join of the tables.

我无法将表B作为pandas数据帧加载到内存中,无法在pandas上使用常规合并功能.我尝试通过读取表B上的文件的块,将每个块与A相交,然后将这些交集连接起来(内部联接的输出).这在速度上还可以,但是时不时地给我带来问题,并吐出分段错误……没有那么大.该错误很难重现,但会在两台不同的计算机(Mac OS X X1 v10.6(Snow Leopard)和UNIX,Red Hat Linux)上发生.

I could not load the table B on memory as a pandas dataframe to use the normal merge function on pandas. I tried by reading the file of table B on chunks, intersecting each chunk with A and the concatenating these intersections (output from inner joins). This is OK on speed but every now and then this gives me problems and spits out a segmentation fault ... no so great. This error is difficult to reproduce, but it happens on two different machines (Mac OS X v10.6 (Snow Leopard) and UNIX, Red Hat Linux).

我最终尝试通过将表B写入磁盘,然后遍历表A并从表B中选择匹配的行来对Pandas和PyTables进行组合.这最后一个选项有效,但是速度很慢. pytables上的表B已默认被索引.

I finally tried with the combination of Pandas and PyTables by writing table B to disk and then iterating over table A and selecting from table B the matching rows. This last options works but it is slow. Table B on pytables has been indexed already by default.

我该如何解决这个问题?

How do I tackle this problem?

推荐答案

这是一个伪伪鳕鱼,但我认为应该很快.

This is a little pseudo codish, but I think should be quite fast.

基于磁盘的直接合并,所有表都在磁盘上.这 关键是您本身并没有做选择,只是建立索引 通过开始/停止进入表格,这非常快.

Straightforward disk based merge, with all tables on disk. The key is that you are not doing selection per se, just indexing into the table via start/stop, which is quite fast.

选择符合B中条件的行(使用A的ID)不会 速度很快,因为我认为这可能会将数据带入Python空间 而不是内核内搜索(我不确定,但您可能希望 在内核内优化部分中对pytables.org进行更多调查. 有一种方法可以判断它是否在内核中.

Selecting the rows that meet a criteria in B (using A's ids) won't be very fast, because I think it might be bringing the data into Python space rather than an in-kernel search (I am not sure, but you might want to investigate on pytables.org more in the in-kernel optimization section. There is a way to tell if it's going to be in-kernel or not).

如果您愿意的话,这是一个非常平行的问题(只是不要写 将结果从多个进程复制到同一文件. pytables对此不是写安全的.

Also if you are up to it, this is a very parallel problem (just don't write the results to the same file from multiple processes. pytables is not write-safe for that).

请参见

See this answer for a comment on how doing a join operation will actually be an 'inner' join.

对于您的merge_a_b操作,我认为您可以使用标准的熊猫连接 (在内存中)这非常有效.

For your merge_a_b operation I think you can use a standard pandas join which is quite efficient (when in-memory).

另一个选择(取决于A的大小)可能是将A分成两部分(索引相同),并在第一个表中使用较小的(可能使用单列);而不是存储合并结果本身,而是存储行索引;之后,您可以提取所需的数据(有点像使用索引器并获取数据).请参见 http://pandas.pydata.org/pandas -docs/stable/io.html#multiple-table-queries

One other option (depending on how 'big' A) is, might be to separate A into 2 pieces (that are indexed the same), using a smaller (maybe use single column) in the first table; instead of storing the merge results per se, store the row index; later you can pull out the data you need (kind of like using an indexer and take). See http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries

A = HDFStore('A.h5')
B = HDFStore('B.h5')

nrows_a = A.get_storer('df').nrows
nrows_b = B.get_storer('df').nrows
a_chunk_size = 1000000
b_chunk_size = 1000000

def merge_a_b(a,b):
    # Function that returns an operation on passed
    # frames, a and b.
    # It could be a merge, join, concat, or other operation that
    # results in a single frame.


for a in xrange(int(nrows_a / a_chunk_size) + 1):

    a_start_i = a * a_chunk_size
    a_stop_i  = min((a + 1) * a_chunk_size, nrows_a)

    a = A.select('df', start = a_start_i, stop = a_stop_i)

    for b in xrange(int(nrows_b / b_chunk_size) + 1):

        b_start_i = b * b_chunk_size
        b_stop_i = min((b + 1) * b_chunk_size, nrows_b)

        b = B.select('df', start = b_start_i, stop = b_stop_i)

        # This is your result store
        m = merge_a_b(a, b)

        if len(m):
            store.append('df_result', m)

这篇关于在Python中合并具有数百万行的两个表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆