在数据框上定义和应用自定义容器 [英] Define and apply custom bins on a dataframe

查看:74
本文介绍了在数据框上定义和应用自定义容器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用python,我创建了以下包含相似值的数据框:

Using python I have created following data frame which contains similarity values:

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture    jaccard
1       0.770     0.489        0.388  0.57500000 0.5845137    0.3920000 0.00000000
2       0.067     0.496        0.912  0.13865546 0.6147309    0.6984127 0.00000000
3       0.514     0.426        0.692  0.36440678 0.4787535    0.5198413 0.05882353
4       0.102     0.430        0.739  0.11297071 0.5288008    0.5436508 0.00000000
5       0.560     0.735        0.554  0.48148148 0.8168083    0.4603175 0.00000000
6       0.029     0.302        0.558  0.08547009 0.3928234    0.4603175 0.00000000

我是试图编写一个R脚本来生成另一个反映bin的数据帧,但是如果值大于0.5,则我的合并条件适用,例如

I am trying to write a R script to generate another data frame that reflects the bins, but my condition of binning applies if the value is above 0.5 such that

Pseudocode:

Pseudocode:

if (cosinFcolor > 0.5 & cosinFcolor <= 0.6)
   bin = 1
if (cosinFcolor > 0.6 & cosinFcolor <= 0.7)
   bin = 2
if (cosinFcolor > 0.7 & cosinFcolor =< 0.8)
   bin = 3
if (cosinFcolor > 0.8 & cosinFcolor <=0.9)
   bin = 4
if (cosinFcolor > 0.9 & cosinFcolor <= 1.0)
   bin = 5
else
   bin = 0

基于上述逻辑,我想构建一个数据框

Based on above logic, I want to build a data frame

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture    jaccard
1       3         0         0            1           1        0               0

如何以脚本形式启动它,还是应该在python中执行此操作?在找出R的功能/它拥有的机器学习包数量之后,我试图熟悉R。
我的目标是建立一个分类器,但首先我需要熟悉R:)

How can I start this as a script, or should I do this in python? I am trying to get familiar with R after finding out how powerful it is/number of machine learning packages it has. My goal is to build a classifier but first I need be familiar with R :)

推荐答案

另一个简单的答案是考虑到极值:

Another cut answer that takes into account extrema:

dat <- read.table("clipboard", header=TRUE)

cuts <- apply(dat, 2, cut, c(-Inf,seq(0.5, 1, 0.1), Inf), labels=0:6)
cuts[cuts=="6"] <- "0"
cuts <- as.data.frame(cuts)

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1           3         0            0           1         1            0       0
2           0         0            5           0         2            2       0
3           1         0            2           0         0            1       0
4           0         0            3           0         1            1       0
5           1         3            1           0         4            0       0
6           0         0            1           0         0            0       0



说明



剪切函数拆分为垃圾箱取决于您指定的削减。因此,让我们以1:10分别将其分割为3、5和7。

Explanation

The cut function splits into bins depending on the cuts you specify. So let's take 1:10 and split it at 3, 5 and 7.

cut(1:10, c(3, 5, 7))
 [1] <NA>  <NA>  <NA>  (3,5] (3,5] (5,7] (5,7] <NA>  <NA>  <NA> 
Levels: (3,5] (5,7]

您会看到它是如何影响休息时间之间的水平的。 't include 3(有一个 include.lowest 参数将包含它)。但是这些是组的糟糕名称,我们称它们为组1和2。

You can see how it has made a factor where the levels are those in between the breaks. Also notice it doesn't include 3 (there's an include.lowest argument which will include it). But these are terrible names for groups, let's call them group 1 and 2.

cut(1:10, c(3, 5, 7), labels=1:2)
 [1] <NA> <NA> <NA> 1    1    2    2    <NA> <NA> <NA>

更好,但是NA是什么呢?它们在我们的范围之内并且不计算在内。要计算它们,在我的解决方案中,我添加了-infinity和infinity,因此所有点都包括在内。休息,我们将需要更多标签:

Better, but what's with the NAs? They are outside our boundaries and not counted. To count them, in my solution, I added -infinity and infinity, so all points would be included. Notice that as we have more breaks, we'll need more labels:

x <- cut(1:10, c(-Inf, 3, 5, 7, Inf), labels=1:4)
 [1] 1 1 1 2 2 3 3 4 4 4
Levels: 1 2 3 4

好,但我们不想4(根据您的问题)。我们希望所有4s都属于第1组。因此,让我们删除标记为 4的条目。

Ok, but we didn't want 4 (as per your problem). We wanted all the 4s to be in group 1. So let's get rid of the entries which are labelled '4'.

x[x=="4"] <- "1"
 [1] 1 1 1 2 2 3 3 1 1 1
Levels: 1 2 3 4

这与我之前所做的略有不同,请注意,我删除了末尾所有的最后一个标签,但是我已经做到了这样,您就可以更好地了解 cut 的工作原理。

This is slightly different to what I did before, notice I took away all the last labels at the end before, but I've done it this way here so you can better see how cut works.

确定,应用函数。到目前为止,我们一直在单个矢量上使用cut。但您希望将其用于向量集合:数据框的每一列。这就是 apply 的第二个参数。 1将功能应用于所有行,2适用于所有列。将 cut 函数应用于数据框的每一列。 apply函数中 cut 之后的所有内容只是 cut 的参数,我们在上面已经讨论过。

Ok, the apply function. So far, we've been using cut on a single vector. But you want it used on a collection of vectors: each column of your data frame. That's what the second argument of apply does. 1 applies the function to all rows, 2 applies to all columns. Apply the cut function to each column of your data frame. Everything after cut in the apply function are just arguments to cut, which we discussed above.

希望有帮助。

这篇关于在数据框上定义和应用自定义容器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆