如何有效计算另一列中每个元素的较大元素的数量? [英] How to efficiently count the number of larger elements for every elements in another column?

查看:40
本文介绍了如何有效计算另一列中每个元素的较大元素的数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 df

    name        created_utc
0   t1_cqug90j  1430438400
1   t1_cqug90k  1430438400
2   t1_cqug90z  1430438400
3   t1_cqug91c  1430438401
4   t1_cqug91e  1430438401
... ...         ...

,其中 name 列中的所有值都是唯一的.我想创建一个字典,其键与列 name 中的元素相同.每个这样的键的值是 created_utc 列中的元素数大于该键的元素数.我的预期结果是类似

in which all values in column name are unique. I would like to create a dictionary whose keys are the same elements as in column name. The value for each such a key is the number of elements in column created_utc greater than that of the key. My expected result is something like

{'t1_cqug90j': 6, 't1_cqug90k': 0, 't1_cqug90z': 3, ...} 

在这种情况下, created_utc 列中有6个大于1430438400的元素,这是 t1_cqug90j 的对应值.我可以做循环来生成这样的字典.但是,在我的情况下,如果行数超过300万,则循环效率不高.

In this case, there are 6 elements in column created_utc greater than 1430438400, which is the corresponding value of t1_cqug90j. I can do the loop to generate such dictionary. However, the loop is not efficient in my case with more than 3 millions rows.

您能详细说明一下吗?

import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/WebMining/main/df1.csv', header = 0)[['name', 'created_utc']]
df

推荐答案

这是一种可能的方法.首先,假设您的 name 列是唯一值.然后我们可以像这样计算 created_utc :

This is a possible approach. Let's first assume that your name column is unique-valued. Then we can count the created_utc like this:

count_utc = df.groupby('created_utc').size()
cumulative_counts = count_utc.shift(fill_value=0).cumsum()

output = dict(zip(df['name'], df['created_utc'].map(cumulative_counts)) )

然后前几个输出看起来像这样:

Then the first few output would look like this:

{'t1_cqug90j': 0,
 't1_cqug90k': 0,
 't1_cqug90z': 0,
 't1_cqug91c': 3,
 't1_cqug91e': 3,
 't1_cqug920': 5
 ...
}

现在,如果 name 的名称不是唯一的(由于您的输出期望,这不太可能),但是在这种情况下,我们当然可以采用 cumulative_counts <的最大值./code>减去 name 计数(?)的大小,如下所示:

Now if the name's are not unique (which is unlikely due to your output expectation), but in which case, we can certainly just take the largest values of the cumulative_counts substract the size of the name count (?), something like this:

output = dict(zip(df['name'],
                  df['created_utc'].map(cumulative_counts)
                      .sub(df.groupby('name')['name'].transform('size'))
                      .add(1)                  
                 ) )

这篇关于如何有效计算另一列中每个元素的较大元素的数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆