数据低方差时在 pandas 中获取四分位数 [英] Getting quartiles in pandas when data has low variance

查看:91
本文介绍了数据低方差时在 pandas 中获取四分位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定我的问题是否可以解决,但我想我会尝试的.搜索没有任何结果.

I'm not sure if my problem is solvable, but thought I'd try; a search gave no result, at any rate.

任务:我有一个很大的数据集-大约40k元素.根据评估者的熟悉程度对其进行评估(即,如果某项的评估等级为0.75,则意味着有75%的评估者熟悉该项目).我想将此数据划分为4个大小相等的容器.自然的方法是使用熊猫的分位数"功能来获得四分位数间距

The task: I have a large-ish dataset––approxomately 40k elements. These are rated in terms of familiarity by raters (i.e. if an item has a rating of 0.75, this means 75% of raters were familiar with it). I want to divide this data into 4 equally sized bins. The natural way to do this is with the pandas 'quantile' function to get interquartile ranges

问题:100%的参与者知道我53%的数据.这意味着我的两个分位数具有相同的值.结果,将分位数功能的结果输入到我的代码中将为其中一个分位数提供一个空的bin,因为第一个bin接收了所有值(请参见下面的代码.)

The problem: 53% of my data is known to 100% of my participants. This means that two of my quantiles have the same value. As a result, feeding the results of the quantile function into my code gives an empty bin for one of the quantiles, as the first bin takes all the values (see code below.)

有人知道将我的数据分成四个偶数组,即使两组中的所有数据都具有相同的值吗?我想重新使用此代码,因此,像指定一个特定的索引范围那样挑剔以挑选出四分之一的数据,会使它对于此数据集过于具体.

Does anyone know of a splitting my data in four even groups, even if all the data in two groups has the same value? I'd like to re-use this code, so putting in a kludge like specifying a specific index range to pick out a quarter of the data makes it too specific to this dataset.

非常感谢!

  data3 = pd.read_csv('filepath.csv')



######### Empty lists to take variables

well = [] # Well-known elements


medwell = [] # Medium well known elements

med = [] # medium known elements

low = [] # Rarely known elements

############# Binning of data by familiarity 

for i in range(39953): 
    if data3['Percent_known'][i] >= data3['Percent_known'].quantile(0.75):
        well.append(data3['Word'][i]) # Familiarity
    elif data3['Percent_known'][i] >= data3['Percent_known'].quantile(0.50) and \
    data3['Percent_known'][i] < data3['Percent_known'].quantile(0.75):
        medwell.append(data3['Word'][i])
    elif data3['Percent_known'][i] >= data3['Percent_known'].quantile(0.25) and \
    data3['Percent_known'][i] < data3['Percent_known'].quantile(0.50):
        med.append(data3['Word'][i])
    else:
        low.append(data3['Word'][i])

推荐答案

我会在Percent_known上添加一个小的随机抖动.这样,您将能够(随机)将所有100%已知的项目分类为分位数.

I would add a small, random jitter to the Percent_known. In this way you will be able to (randomly) sort all the items known 100% into quantiles.

# create data
df = pd.DataFrame([1, 1, 1, 1, 0.5, 0.5, 0, 0], columns=['known'])

df['fudge'] = df.known + 0.01 * (np.random.rand(len(df)) - 0.5)

df.known[df.fudge > df.fudge.quantile(0.75)]

最后一行会从已知100%的项目中随机选择四分之一.

The last line will randomly select a quarter of items among those who are known 100%.

此外,以矢量化方式而不是循环方式来计算分位数将更加有效.例如:

Additionally, it would be much more efficient to calculate quantiles in a vectorized fashion rather than with a loop. For instance:

df['quant'] = np.nan

for q in [0.75, 0.5, 0.25]:
    df.loc[(df.fudge <= df.fudge.quantile(q + 0.25)) & (df.fudge > df.fudge.quantile(q)), 'quant'] = q

df.quant.fillna(0.0, inplace=True)

这篇关于数据低方差时在 pandas 中获取四分位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆