如何有效地计算一列中每个元素的子代数? [英] How to efficiently count the number of children for each element in a column?

查看:49
本文介绍了如何有效地计算一列中每个元素的子代数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框 df 如下.

    parent_id   name
0   t3_35jfjt   t1_cr4y72v
1   t3_35jfjt   t1_cr4y7m7
2   t3_35jfjt   t1_cr4y7p3
3   t1_cr4y72v  t1_cr4y92z
4   t3_35jfjt   t1_cr4y986
... ...         ...

,其中 name 列中的所有元素都是唯一的.我想创建一个字典,其键是来自 name 列的元素.对于每个这样的键,我们在 parent_id 列上对其频率进行计数.如果它没有出现在 parent_id 列中,那么该键的值当然是0.

in which all elements in column name are unique. I would like to create a dictionary whose keys are elements from column name. For each such a key, we count its frequency on column parent_id. If it does not appear in column parent_id, then the value of such key is of course 0.

我这样做如下,但是效率不高,因为我有超过300万行.您能详细介绍一种更有效的方法吗?

I do so as below, but it's not efficient since I have over 3 millions rows. Could you please elaborate on a more efficient method?

import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/WebMining/main/df.csv', header = 0)

# Create df2 to contain the counts
df2 = df.groupby(by = 'parent_id', as_index = False).size()

# Join df2 and df based on column "parent_id"
df3 = pd.merge(df, df2, how = 'left', left_on= 'name', right_on= 'parent_id')

# Replace NaN with 0
df4 = df3.fillna(0).rename(columns = {'size': 'num_siblings'})
df5 = df4[['name', 'num_siblings']]

# My expected dictionary
df5.set_index('name').T.to_dict('records')[0]

{'t1_cr4y72v': 27.0,
 't1_cr4y7m7': 26.0,
 't1_cr4y7p3': 148.0,
 't1_cr4y92z': 0.0,
 't1_cr4y986': 43.0,
 't1_cr4ya0g': 11.0,
 't1_cr4yai8': 1.0,
....

推荐答案

您想要这样的东西吗?

import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/WebMining/main/df.csv', header = 0)

# Create df2 to contain the counts
df2 = df.groupby(by = 'parent_id').size()

df2.reindex(df['name'], fill_value=0).to_dict()

这篇关于如何有效地计算一列中每个元素的子代数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆