在数据框上定义和应用自定义 bin [英] Define and apply custom bins on a dataframe

查看:18
本文介绍了在数据框上定义和应用自定义 bin的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 python 我创建了以下包含相似值的数据框:

Using python I have created following data frame which contains similarity values:

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture    jaccard
1       0.770     0.489        0.388  0.57500000 0.5845137    0.3920000 0.00000000
2       0.067     0.496        0.912  0.13865546 0.6147309    0.6984127 0.00000000
3       0.514     0.426        0.692  0.36440678 0.4787535    0.5198413 0.05882353
4       0.102     0.430        0.739  0.11297071 0.5288008    0.5436508 0.00000000
5       0.560     0.735        0.554  0.48148148 0.8168083    0.4603175 0.00000000
6       0.029     0.302        0.558  0.08547009 0.3928234    0.4603175 0.00000000

我正在尝试编写一个 R 脚本来生成另一个反映 bin 的数据框,但是如果该值高于 0.5,则我的分箱条件适用

I am trying to write a R script to generate another data frame that reflects the bins, but my condition of binning applies if the value is above 0.5 such that

伪代码:

if (cosinFcolor > 0.5 & cosinFcolor <= 0.6)
   bin = 1
if (cosinFcolor > 0.6 & cosinFcolor <= 0.7)
   bin = 2
if (cosinFcolor > 0.7 & cosinFcolor =< 0.8)
   bin = 3
if (cosinFcolor > 0.8 & cosinFcolor <=0.9)
   bin = 4
if (cosinFcolor > 0.9 & cosinFcolor <= 1.0)
   bin = 5
else
   bin = 0

基于上面的逻辑,我想构建一个数据框

Based on above logic, I want to build a data frame

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture    jaccard
1       3         0         0            1           1        0               0

如何将其作为脚本启动,或者我应该在 python 中执行此操作?在发现它有多强大/它拥有的机器学习包的数量后,我试图熟悉 R.我的目标是构建一个分类器,但首先我需要熟悉 R :)

How can I start this as a script, or should I do this in python? I am trying to get familiar with R after finding out how powerful it is/number of machine learning packages it has. My goal is to build a classifier but first I need be familiar with R :)

推荐答案

另一个考虑极值的简单答案:

Another cut answer that takes into account extrema:

dat <- read.table("clipboard", header=TRUE)

cuts <- apply(dat, 2, cut, c(-Inf,seq(0.5, 1, 0.1), Inf), labels=0:6)
cuts[cuts=="6"] <- "0"
cuts <- as.data.frame(cuts)

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1           3         0            0           1         1            0       0
2           0         0            5           0         2            2       0
3           1         0            2           0         0            1       0
4           0         0            3           0         1            1       0
5           1         3            1           0         4            0       0
6           0         0            1           0         0            0       0

说明

cut 函数根据您指定的切割分成多个 bin.所以让我们把 1:10 分成 3、5 和 7.

Explanation

The cut function splits into bins depending on the cuts you specify. So let's take 1:10 and split it at 3, 5 and 7.

cut(1:10, c(3, 5, 7))
 [1] <NA>  <NA>  <NA>  (3,5] (3,5] (5,7] (5,7] <NA>  <NA>  <NA> 
Levels: (3,5] (5,7]

您可以看到它是如何产生一个因素,其中水平是休息之间的水平.还要注意它不包含 3(有一个 include.lowest 参数将包含它).但是对于团体来说,这些名字太糟糕了,让我们称它们为第 1 组和第 2 组.

You can see how it has made a factor where the levels are those in between the breaks. Also notice it doesn't include 3 (there's an include.lowest argument which will include it). But these are terrible names for groups, let's call them group 1 and 2.

cut(1:10, c(3, 5, 7), labels=1:2)
 [1] <NA> <NA> <NA> 1    1    2    2    <NA> <NA> <NA>

更好,但是 NA 怎么了?它们在我们的边界之外,没有被计算在内.为了计算它们,在我的解决方案中,我添加了 -infinity 和 infinity,因此将包括所有点.请注意,随着我们有更多的休息时间,我们将需要更多的标签:

Better, but what's with the NAs? They are outside our boundaries and not counted. To count them, in my solution, I added -infinity and infinity, so all points would be included. Notice that as we have more breaks, we'll need more labels:

x <- cut(1:10, c(-Inf, 3, 5, 7, Inf), labels=1:4)
 [1] 1 1 1 2 2 3 3 4 4 4
Levels: 1 2 3 4

好的,但我们不想要 4 个(根据您的问题).我们希望所有 4 都在第 1 组中.所以让我们去掉标记为4"的条目.

Ok, but we didn't want 4 (as per your problem). We wanted all the 4s to be in group 1. So let's get rid of the entries which are labelled '4'.

x[x=="4"] <- "1"
 [1] 1 1 1 2 2 3 3 1 1 1
Levels: 1 2 3 4

这与我之前所做的略有不同,请注意我之前删除了最后所有最后的标签,但我在这里这样做是为了让您更好地了解 cut 是如何工作的.

This is slightly different to what I did before, notice I took away all the last labels at the end before, but I've done it this way here so you can better see how cut works.

好的,apply 函数.到目前为止,我们一直在对单个向量使用 cut .但是您希望它用于一组向量:数据框的每一列.这就是 apply 的第二个参数所做的.1 将函数应用于所有行,2 应用于所有列.将 cut 函数应用于数据框的每一列.apply 函数中 cut 之后的所有内容都只是我们上面讨论的 cut 的参数.

Ok, the apply function. So far, we've been using cut on a single vector. But you want it used on a collection of vectors: each column of your data frame. That's what the second argument of apply does. 1 applies the function to all rows, 2 applies to all columns. Apply the cut function to each column of your data frame. Everything after cut in the apply function are just arguments to cut, which we discussed above.

希望有所帮助.

这篇关于在数据框上定义和应用自定义 bin的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆