在 Python 中合并具有数百万行的两个表 [英] Merging two tables with millions of rows in Python

查看:16
本文介绍了在 Python 中合并具有数百万行的两个表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Python 进行一些数据分析.我有两个表,第一个(我们称之为A")有 1000 万行和 10 列,第二个(B")有 7300 万行和 2 列.他们有 1 列具有公共 ID,我想根据该列将两个表相交.特别是我想要表的内连接.

I am using Python for some data analysis. I have two tables, the first (let's call it 'A') has 10 million rows and 10 columns and the second ('B') has 73 million rows and 2 columns. They have 1 column with common ids and I want to intersect the two tables based on that column. In particular I want the inner join of the tables.

我无法在内存中加载表 B 作为 Pandas 数据帧以在 Pandas 上使用正常的合并功能.我尝试通过读取块上的表 B 的文件,将每个块与 A 相交并连接这些交集(来自内部连接的输出).这在速度上是可以的,但时不时这会给我带来问题并吐出一个分段错误......没有那么好.这个错误很难重现,但它发生在两台不同的机器上(Mac OS X v10.6 (Snow Leopard) 和 UNIX、Red Hat Linux).

I could not load the table B on memory as a pandas dataframe to use the normal merge function on pandas. I tried by reading the file of table B on chunks, intersecting each chunk with A and the concatenating these intersections (output from inner joins). This is OK on speed but every now and then this gives me problems and spits out a segmentation fault ... no so great. This error is difficult to reproduce, but it happens on two different machines (Mac OS X v10.6 (Snow Leopard) and UNIX, Red Hat Linux).

我终于尝试将 Pandas 和 PyTables 结合起来,将表 B 写入磁盘,然后迭代表 A 并从表 B 中选择匹配的行.最后一个选项有效,但速度很慢.默认情况下,pytables 上的表 B 已被索引.

I finally tried with the combination of Pandas and PyTables by writing table B to disk and then iterating over table A and selecting from table B the matching rows. This last options works but it is slow. Table B on pytables has been indexed already by default.

我该如何解决这个问题?

How do I tackle this problem?

推荐答案

这有点伪代码,但我认为应该很快.

This is a little pseudo codish, but I think should be quite fast.

简单的基于磁盘的合并,所有表都在磁盘上.这关键是你没有做选择本身,只是索引通过开始/停止进入表格,速度相当快.

Straightforward disk based merge, with all tables on disk. The key is that you are not doing selection per se, just indexing into the table via start/stop, which is quite fast.

选择满足 B 中条件的行(使用 A 的 id)不会非常快,因为我认为它可能会将数据带入 Python 空间而不是内核搜索(我不确定,但你可能想要在内核优化部分对 pytables.org 进行更多调查.有一种方法可以判断它是否会在内核中).

Selecting the rows that meet a criteria in B (using A's ids) won't be very fast, because I think it might be bringing the data into Python space rather than an in-kernel search (I am not sure, but you might want to investigate on pytables.org more in the in-kernel optimization section. There is a way to tell if it's going to be in-kernel or not).

此外,如果您愿意,这是一个非常并行的问题(只是不要写结果来自多个进程的同一个文件.pytables 不是写安全的).

Also if you are up to it, this is a very parallel problem (just don't write the results to the same file from multiple processes. pytables is not write-safe for that).

this answer 评论如何进行联接操作实际上是内部"联接.

See this answer for a comment on how doing a join operation will actually be an 'inner' join.

对于您的 merge_a_b 操作,我认为您可以使用标准的熊猫连接这是非常有效的(在内存中时).

For your merge_a_b operation I think you can use a standard pandas join which is quite efficient (when in-memory).

另一个选项(取决于 A 的大"程度)可能是将 A 分成 2 个部分(索引相同),在第一个表中使用较小的(可能使用单列);不是存储合并结果本身,而是存储行索引;稍后您可以提取您需要的数据(有点像使用索引器并获取).请参阅 http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries

One other option (depending on how 'big' A) is, might be to separate A into 2 pieces (that are indexed the same), using a smaller (maybe use single column) in the first table; instead of storing the merge results per se, store the row index; later you can pull out the data you need (kind of like using an indexer and take). See http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries

A = HDFStore('A.h5')
B = HDFStore('B.h5')

nrows_a = A.get_storer('df').nrows
nrows_b = B.get_storer('df').nrows
a_chunk_size = 1000000
b_chunk_size = 1000000

def merge_a_b(a,b):
    # Function that returns an operation on passed
    # frames, a and b.
    # It could be a merge, join, concat, or other operation that
    # results in a single frame.


for a in xrange(int(nrows_a / a_chunk_size) + 1):

    a_start_i = a * a_chunk_size
    a_stop_i  = min((a + 1) * a_chunk_size, nrows_a)

    a = A.select('df', start = a_start_i, stop = a_stop_i)

    for b in xrange(int(nrows_b / b_chunk_size) + 1):

        b_start_i = b * b_chunk_size
        b_stop_i = min((b + 1) * b_chunk_size, nrows_b)

        b = B.select('df', start = b_start_i, stop = b_stop_i)

        # This is your result store
        m = merge_a_b(a, b)

        if len(m):
            store.append('df_result', m)

这篇关于在 Python 中合并具有数百万行的两个表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆