在Pandas中合并索引上的数据框更有效 [英] Merging dataframes on an index is more efficient in Pandas

查看:81
本文介绍了在Pandas中合并索引上的数据框更有效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么在Pandas中合并索引上的数据框比在列上合并更有效(更快)?

Why is merging dataframes in Pandas on an index more efficient (faster) than on a column?

import pandas as pd

# Dataframes share the ID column
df = pd.DataFrame({'ID': [0, 1, 2, 3, 4],
                   'Job': ['teacher', 'scientist', 'manager', 'teacher', 'nurse']})

df2 = pd.DataFrame({'ID': [2, 3, 4, 5, 6, 7, 8],
                    'Level': [12, 15, 14, 20, 21, 11, 15], 
                    'Age': [33, 41, 42, 50, 45, 28, 32]})

df = df.set_index('ID')
df2 = df2.set_index('ID')

这代表大约3.5倍的速度! (使用熊猫0.23.0)

This represents about a 3.5 times speed up! (Using Pandas 0.23.0)

通读 Pandas内部页面,它会显示索引填充标签到在Cython中进行O(1)查找的命令."这是否意味着使用索引进行操作要比使用列更有效?始终将索引用于合并等操作是否是最佳做法?

Reading through the Pandas internals page it says an Index "Populates a dict of label to location in Cython to do O(1) lookups." Does this mean that doing operations with an index is more efficient than with columns? Is it a best practice to always use the index for operations such as merges?

我阅读了文档,连接和合并,它没有明确提及使用索引的任何好处.

I read through the documentation for joining and merging and it doesn't explicitly mention any benefits to using the index.

推荐答案

这样做的原因是DataFrame的索引

The reason for this is that the DataFrame's index is backed by a hash table.

要合并两个集合,我们需要为第一个元素的每个元素找到第二个元素的对应元素(如果存在).如果哈希表支持搜索,则搜索速度会大大提高,因为在未排序列表中的搜索为O(N),而在哈希函数〜O(1)支持的列表中.

To merge two sets, we need to find for each element of the first the corresponding in the second (if it exists) Searching is significantly faster if supported by a hash table because searching in an unsorted list is O(N), while in a list supported by a hash function ~O(1).

一种可能更快合并列的策略是首先为两者中的最小者创建一个哈希表.尽管如此,这意味着合并将比创建此字典所需的时间慢.

One strategy that could be faster to merge columns would be to first create a hash table for the smallest of the two. Still that means that the merge will be slower by the time it takes to create this dict.

这篇关于在Pandas中合并索引上的数据框更有效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆