如何将数字数据映射到 Pandas 数据框中的类别/箱 [英] How to map numeric data into categories / bins in Pandas dataframe

查看:19
本文介绍了如何将数字数据映射到 Pandas 数据框中的类别/箱的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始用 python 编码,我的一般编码技能相当生疏:(所以请耐心等待

I've just started coding in python, and my general coding skills are fairly rusty :( so please be a bit patient

我有一个熊猫数据框:

它有大约 300 万行.有 3 种 age_units:Y、D、W 表示年,Days &周.任何超过 1 岁的人的年龄单位都是 Y,我想要的第一个分组是 <2 岁,所以我必须在年龄单位中测试 Y...

It has around 3m rows. There are 3 kinds of age_units: Y, D, W for years, Days & Weeks. Any individual over 1 year old has an age unit of Y and my first grouping I want is <2y old so all I have to test for in Age Units is Y...

我想创建一个新列 AgeRange 并填充以下范围:

I want to create a new column AgeRange and populate with the following ranges:

  • <2
  • 2 - 18
  • 18 - 35
  • 35 - 65
  • 65+

所以我写了一个函数

def agerange(values):
    for i in values:
        if complete.Age_units == 'Y':
            if complete.Age > 1 AND < 18 return '2-18'
            elif complete.Age > 17 AND < 35 return '18-35'
            elif complete.Age > 34 AND < 65 return '35-65'
            elif complete.Age > 64 return '65+'
        else return '< 2'

我想如果我将数据框作为一个整体传入,我会得到我需要的东西,然后可以创建我想要的列:

I thought if I passed in the dataframe as a whole I would get back what I needed and then could create the column I wanted something like this:

agedetails['age_range'] = ageRange(agedetails)

但是当我尝试运行第一个代码来创建我得到的函数时:

BUT when I try to run the first code to create the function I get:

  File "<ipython-input-124-cf39c7ce66d9>", line 4
    if complete.Age > 1 AND complete.Age < 18 return '2-18'
                          ^
SyntaxError: invalid syntax

显然它不接受 AND - 但我想我在课堂上听说我可以像这样使用 AND ?我一定是弄错了,但是这样做的正确方法是什么?

Clearly it is not accepting the AND - but I thought I heard in class I could use AND like this? I must be mistaken but then what would be the right way to do this?

因此,在收到该错误后,我什至不确定传入数据帧的方法是否会引发错误.我猜可能是的.在这种情况下 - 我将如何使其工作?

So after getting that error, I'm not even sure the method of passing in a dataframe will throw an error either. I am guessing probably yes. In which case - how would I make that work as well?

我想学习最好的方法,但对我来说最好的方法之一就是保持简单,即使这意味着要分几步做...

I am looking to learn the best method, but part of the best method for me is keeping it simple even if that means doing things in a couple of steps...

推荐答案

使用 Pandas,您应该避免按行操作,因为这些操作通常涉及低效的 Python 级循环.这里有几个替代方案.

With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.

正如@JonClements 所建议的,您可以为此使用 pd.cut,这样做的好处是您的新列变成了 分类.

As @JonClements suggests, you can use pd.cut for this, the benefit here being that your new column becomes a Categorical.

您只需要定义边界(包括np.inf)和类别名称,然后将pd.cut应用到所需的数字列.

You only need to define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.

bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']

df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)

print(df.dtypes)

# Age             int64
# Age_units      object
# AgeRange     category
# dtype: object

NumPy:np.digitize

np.digitize 提供了另一种干净的解决方案.这个想法是定义你的边界和名字,创建一个字典,然后将 np.digitize 应用到你的年龄列.最后,使用您的字典来映射您的类别名称.

NumPy: np.digitize

np.digitize provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize to your Age column. Finally, use your dictionary to map your category names.

请注意,对于边界情况,下限用于映射到 bin.

Note that for boundary cases the lower bound is used for mapping to a bin.

import pandas as pd, numpy as np

df = pd.DataFrame({'Age': [99, 53, 71, 84, 84],
                   'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y']})

bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']

d = dict(enumerate(names, 1))

df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))

结果

   Age Age_units AgeRange
0   99         Y      65+
1   53         Y    35-65
2   71         Y      65+
3   84         Y      65+
4   84         Y      65+

这篇关于如何将数字数据映射到 Pandas 数据框中的类别/箱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆