如何有效计算另一列中每个元素的较大元素的数量? [英] How to efficiently count the number of larger elements for every elements in another column?
问题描述
我有以下 df
name created_utc
0 t1_cqug90j 1430438400
1 t1_cqug90k 1430438400
2 t1_cqug90z 1430438400
3 t1_cqug91c 1430438401
4 t1_cqug91e 1430438401
... ... ...
,其中 name
列中的所有值都是唯一的.我想创建一个字典,其键与列 name
中的元素相同.每个这样的键的值是 created_utc
列中的元素数大于该键的元素数.我的预期结果是类似
in which all values in column name
are unique. I would like to create a dictionary whose keys are the same elements as in column name
. The value for each such a key is the number of elements in column created_utc
greater than that of the key. My expected result is something like
{'t1_cqug90j': 6, 't1_cqug90k': 0, 't1_cqug90z': 3, ...}
在这种情况下, created_utc
列中有6个大于1430438400的元素,这是 t1_cqug90j
的对应值.我可以做循环来生成这样的字典.但是,在我的情况下,如果行数超过300万,则循环效率不高.
In this case, there are 6 elements in column created_utc
greater than 1430438400, which is the corresponding value of t1_cqug90j
. I can do the loop to generate such dictionary. However, the loop is not efficient in my case with more than 3 millions rows.
您能详细说明一下吗?
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/WebMining/main/df1.csv', header = 0)[['name', 'created_utc']]
df
推荐答案
这是一种可能的方法.首先,假设您的 name
列是唯一值.然后我们可以像这样计算 created_utc
:
This is a possible approach. Let's first assume that your name
column is unique-valued. Then we can count the created_utc
like this:
count_utc = df.groupby('created_utc').size()
cumulative_counts = count_utc.shift(fill_value=0).cumsum()
output = dict(zip(df['name'], df['created_utc'].map(cumulative_counts)) )
然后前几个输出看起来像这样:
Then the first few output would look like this:
{'t1_cqug90j': 0,
't1_cqug90k': 0,
't1_cqug90z': 0,
't1_cqug91c': 3,
't1_cqug91e': 3,
't1_cqug920': 5
...
}
现在,如果 name
的名称不是唯一的(由于您的输出期望,这不太可能),但是在这种情况下,我们当然可以采用 cumulative_counts <的最大值./code>减去
name
计数(?)的大小,如下所示:
Now if the name
's are not unique (which is unlikely due to your output expectation), but in which case, we can certainly just take the largest values of the cumulative_counts
substract the size of the name
count (?), something like this:
output = dict(zip(df['name'],
df['created_utc'].map(cumulative_counts)
.sub(df.groupby('name')['name'].transform('size'))
.add(1)
) )
这篇关于如何有效计算另一列中每个元素的较大元素的数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!