Python生成特定长度的唯一范围并对其进行分类 [英] Python Generate unique ranges of a specific length and categorize them

查看:60
本文介绍了Python生成特定长度的唯一范围并对其进行分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框列,用于指定用户执行活动的次数.例如.

<预><代码>>>>df['活动计数']用户活动计数用户 0 220用户 1 190用户 2 105用户 3 109用户 4 271用户 5 265...用户 95 64用户 96 15用户97 168用户 98 251用户 99 278名称:ActivityCount,长度:100,数据类型:int32>>>活动 = 排序(df['ActivityCount'].unique())[9, 15, 16, 17, 20, 23, 25, 26, 28, 31, 33, 34, 36, 38, 39, 43, 49, 57, 59, 64, 65, 71, 76, 77, 7,83, 88, 94, 95, 100, 105, 109, 110, 111, 115, 116, 117, 120, 132, 137, 138, 139, 140, 141, 4,5,15, 15, 15, 14162, 168, 177, 180, 182, 186, 190, 192, 194, 197, 203, 212, 213, 220, 223, 231, 232, 238, 4, 5, 2, 5, 20, 24, 5, 2, 5, 20265, 268, 269, 271, 272, 276, 278, 282, 283, 285, 290]

根据他们的 ActivityCount,我必须将用户分为 5 个不同的类别,例如 A、B、C、DE.活动计数范围不时变化.在上面的例子中,它大约介于 (9-290)(最低和最高系列)之间,它可以是 (5-500)(5 到 30).在上面的示例中,我可以将最大活动数除以 5,然后将每个用户在 58 (from 290/5) 范围内进行分类,例如 Range A: 0-58范围 B:59-116范围 C:117-174...等

有没有其他方法可以使用 pandas 或 numpy 来实现这一点,以便我可以直接将列归入给定类别?预期输出:-

<预><代码>>>>df用户 ActivityCount 类别/范围用户0 220 D用户 1 190 D用户 2 105 B用户 3 109 B用户 4 271 E用户 5 265 E...用户95 64 BUser96 15 A用户97 168 C用户 98 251 E用户 99 278 E

解决方案

最自然的方法是将数据拆分为 5 个数量,然后根据这些数量将数据拆分为多个 bin.幸运的是,pandas 可以让你轻松做到这一点:

df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])

输出类似于:

 活动类别34 115 羽15 43 一个57 192 天78 271 电子26 88羽6 25 一55 186 天63 220 天1 15 一76 268 电子

另一种观点 - 聚类

在上述方法中,我们将数据分成 5 个 bin,其中不同 bin 的大小相等.另一种更复杂的方法是将数据分成 5 个集群,并旨在使每个集群中的数据点尽可能相似.在机器学习中,这被称为聚类/分类问题.

一种经典的聚类算法是 k-means.它通常用于具有多个维度(例如每月活动、年龄、性别等)的数据.因此,这是一个非常简单的聚类案例.

在这种情况下,k-means聚类可以通过以下方式完成:

导入scipy从 scipy.cluster.vq 导入 vq、kmeans、whitendf = pd.DataFrame({"Activity": l})features = np.array([[x] for x in df.Activity])白化 = 白化(特征)码本,失真 = kmeans(白化,5)代码,dist = vq(白化,码本)df["类别"] = 代码

输出如下:

 活动类别40 138 179 272 072 255 013 38 341 139 165 231 026 88 259 197 476 268 045 145 1

注意事项:

  • 类别的标签是随机的.在这种情况下,标签 '2' 指的是比 lavel '1' 更高的活动.
  • 我没有将标签从 0-4 迁移到 A-E.这可以使用熊猫的 map 轻松完成.

I have a dataframe column which specifies how many times a user has performed an activity. eg.

>>> df['ActivityCount']
Users     ActivityCount
User0     220
User1     190
User2     105
User3     109
User4     271
User5     265
     ...
User95     64
User96     15
User97    168
User98    251
User99    278
Name: ActivityCount, Length: 100, dtype: int32


>>> activities = sorted(df['ActivityCount'].unique())
[9, 15, 16, 17, 20, 23, 25, 26, 28, 31, 33, 34, 36, 38, 39, 43, 49, 57, 59, 64, 65, 71, 76, 77, 78,
83, 88, 94, 95, 100, 105, 109, 110, 111, 115, 116, 117, 120, 132, 137, 138, 139, 140, 141, 144, 145, 148, 153, 155, 157, 162, 168, 177, 180, 182, 186, 190, 192, 194, 197, 203, 212, 213, 220, 223, 231, 232, 238, 240, 244, 247, 251, 255, 258, 260, 265, 268, 269, 271, 272, 276, 278, 282, 283, 285, 290]

According to their ActivityCount, I have to divide users into 5 different categories eg A, B, C, D and E. Activity Count range varies from time to time. In the above example it's approx in-between (9-290) (lowest and highest of the series), it could be (5-500) or (5 to 30). In above example, I can take the max number of activities and divide it by 5 and categorize each user between the range of 58 (from 290/5) like Range A: 0-58, Range B: 59-116, Range C: 117-174...etc

Is there any other way to achieve this using pandas or numpy, so that I can directly categorize the column in the given categories? Expected output: -

>>> df
Users     ActivityCount  Category/Range 
User0     220             D
User1     190             D
User2     105             B 
User3     109             B
User4     271             E  
User5     265             E
     ...
User95     64             B
User96     15             A
User97    168             C
User98    251             E
User99    278             E

解决方案

The natural way to do that would be to split the data into 5 quanties, and then split the data into bins based on these quantities. Luckily, pandas allows you do easily do that:

df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])

The output is something like:

    Activity Category
34       115        b
15        43        a
57       192        d
78       271        e
26        88        b
6         25        a
55       186        d
63       220        d
1         15        a
76       268        e

An alternative view - clustering

In the above method, we've split the data into 5 bins, where the sizes of the different bins are equal. An alternative, more sophisticated approach, would be to split the data into 5 clusters and aim to have the data points in each cluster as similar to each other as possible. In machine learning, this is known as a clustering / classification problem.

One classic clustering algorithm is k-means. It's typically used for data with multiple dimensions (e.g. monthly activity, age, gender, etc.) This is, therefore, a very simplistic case of clustering.

In this case, k-means clustering can be done in the following way:

import scipy
from scipy.cluster.vq import vq, kmeans, whiten

df = pd.DataFrame({"Activity": l})

features = np.array([[x] for x in df.Activity])
whitened = whiten(features)
codebook, distortion = kmeans(whitened, 5) 
code, dist = vq(whitened, codebook)

df["Category"] = code

And the output looks like:

    Activity  Category
40       138         1
79       272         0
72       255         0
13        38         3
41       139         1
65       231         0
26        88         2
59       197         4
76       268         0
45       145         1

A couple of notes:

  • The labels of the categories are random. In this case label '2' refers to higher activity than lavel '1'.
  • I didn't migrate the labels from 0-4 to A-E. This can easily be done using pandas' map.

这篇关于Python生成特定长度的唯一范围并对其进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆