NumPy:如何保留具有重复项的连接数组 [英] NumPy: how to left join arrays with duplicates

查看：33 发布时间：2021/4/28 18:35:20 python pandas numpy cython

本文介绍了NumPy:如何保留具有重复项的连接数组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

要使用Cython，我需要将 df1.merge(df2，how ='left')(使用 Pandas )转换为普通的 NumPy ，虽然我发现 numpy.lib.recfunctions.join_by(key，r1，r2，jointype ='leftouter')不支持 key 上的任何重复项.有什么办法解决吗?

To use Cython, I need to convert df1.merge(df2, how='left') (using Pandas) to plain NumPy, while I found numpy.lib.recfunctions.join_by(key, r1, r2, jointype='leftouter') doesn't support any duplicates along key. Is there any way to solve it?

推荐答案

在纯左 numpy 联接处有一个刺，可以处理重复的键:

Here's a stab at a pure numpy left join that can handle duplicate keys:

import numpy as np

def join_by_left(key, r1, r2, mask=True):
    # figure out the dtype of the result array
    descr1 = r1.dtype.descr
    descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]
    descrm = descr1 + descr2 

    # figure out the fields we'll need from each array
    f1 = [d[0] for d in descr1]
    f2 = [d[0] for d in descr2]

    # cache the number of columns in f1
    ncol1 = len(f1)

    # get a dict of the rows of r2 grouped by key
    rows2 = {}
    for row2 in r2:
        rows2.setdefault(row2[key], []).append(row2)

    # figure out how many rows will be in the result
    nrowm = 0
    for k1 in r1[key]:
        if k1 in rows2:
            nrowm += len(rows2[k1])
        else:
            nrowm += 1

    # allocate the return array
    _ret = np.recarray(nrowm, dtype=descrm)
    if mask:
        ret = np.ma.array(_ret, mask=True)
    else:
        ret = _ret

    # merge the data into the return array
    i = 0
    for row1 in r1:
        if row1[key] in rows2:
            for row2 in rows2[row1[key]]:
                ret[i] = tuple(row1[f1]) + tuple(row2[f2])
                i += 1
        else:
            for j in range(ncol1):
                ret[i][j] = row1[j]
            i += 1

    return ret

基本上，它使用普通的 dict 进行实际的联接操作.像 numpy.lib.recfunctions.join_by 一样，此函数还将返回一个掩码数组.当右边的数组中缺少键时，这些值将在返回数组中被屏蔽掉.如果您希望使用记录数组(所有丢失的数据都设置为0)，则可以在调用 join_by_left 时传递 mask = False .

Basically, it uses a plain dict to do the actual join operation. Like numpy.lib.recfunctions.join_by, this func will also return a masked array. When there are keys missing from the right array, those values will be masked out in the return array. If you would prefer a record array instead (in which all of the missing data is set to 0), you can just pass mask=False when calling join_by_left.

这篇关于NumPy:如何保留具有重复项的连接数组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

NumPy:如何保留具有重复项的连接数组 [英] NumPy: how to left join arrays with duplicates

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

NumPy:如何保留具有重复项的连接数组 [英] NumPy: how to left join arrays with duplicates

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭