NumPy:如何保留具有重复项的连接数组 [英] NumPy: how to left join arrays with duplicates

查看:33
本文介绍了NumPy:如何保留具有重复项的连接数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要使用Cython,我需要将 df1.merge(df2,how ='left')(使用 Pandas )转换为普通的 NumPy ,虽然我发现 numpy.lib.recfunctions.join_by(key,r1,r2,jointype ='leftouter')不支持 key 上的任何重复项.有什么办法解决吗?

To use Cython, I need to convert df1.merge(df2, how='left') (using Pandas) to plain NumPy, while I found numpy.lib.recfunctions.join_by(key, r1, r2, jointype='leftouter') doesn't support any duplicates along key. Is there any way to solve it?

推荐答案

在纯左 numpy 联接处有一个刺,可以处理重复的键:

Here's a stab at a pure numpy left join that can handle duplicate keys:

import numpy as np

def join_by_left(key, r1, r2, mask=True):
    # figure out the dtype of the result array
    descr1 = r1.dtype.descr
    descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]
    descrm = descr1 + descr2 

    # figure out the fields we'll need from each array
    f1 = [d[0] for d in descr1]
    f2 = [d[0] for d in descr2]

    # cache the number of columns in f1
    ncol1 = len(f1)

    # get a dict of the rows of r2 grouped by key
    rows2 = {}
    for row2 in r2:
        rows2.setdefault(row2[key], []).append(row2)

    # figure out how many rows will be in the result
    nrowm = 0
    for k1 in r1[key]:
        if k1 in rows2:
            nrowm += len(rows2[k1])
        else:
            nrowm += 1

    # allocate the return array
    _ret = np.recarray(nrowm, dtype=descrm)
    if mask:
        ret = np.ma.array(_ret, mask=True)
    else:
        ret = _ret

    # merge the data into the return array
    i = 0
    for row1 in r1:
        if row1[key] in rows2:
            for row2 in rows2[row1[key]]:
                ret[i] = tuple(row1[f1]) + tuple(row2[f2])
                i += 1
        else:
            for j in range(ncol1):
                ret[i][j] = row1[j]
            i += 1

    return ret

基本上,它使用普通的 dict 进行实际的联接操作.像 numpy.lib.recfunctions.join_by 一样,此函数还将返回一个掩码数组.当右边的数组中缺少键时,这些值将在返回数组中被屏蔽掉.如果您希望使用记录数组(所有丢失的数据都设置为0),则可以在调用 join_by_left 时传递 mask = False .

Basically, it uses a plain dict to do the actual join operation. Like numpy.lib.recfunctions.join_by, this func will also return a masked array. When there are keys missing from the right array, those values will be masked out in the return array. If you would prefer a record array instead (in which all of the missing data is set to 0), you can just pass mask=False when calling join_by_left.

这篇关于NumPy:如何保留具有重复项的连接数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆