在数据框上定义和应用自定义容器 [英] Define and apply custom bins on a dataframe
问题描述
使用python,我创建了以下包含相似值的数据框:
Using python I have created following data frame which contains similarity values:
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000
2 0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000
3 0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353
4 0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000
5 0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000
6 0.029 0.302 0.558 0.08547009 0.3928234 0.4603175 0.00000000
我是试图编写一个R脚本来生成另一个反映bin的数据帧,但是如果值大于0.5,则我的合并条件适用,例如
I am trying to write a R script to generate another data frame that reflects the bins, but my condition of binning applies if the value is above 0.5 such that
Pseudocode:
Pseudocode:
if (cosinFcolor > 0.5 & cosinFcolor <= 0.6)
bin = 1
if (cosinFcolor > 0.6 & cosinFcolor <= 0.7)
bin = 2
if (cosinFcolor > 0.7 & cosinFcolor =< 0.8)
bin = 3
if (cosinFcolor > 0.8 & cosinFcolor <=0.9)
bin = 4
if (cosinFcolor > 0.9 & cosinFcolor <= 1.0)
bin = 5
else
bin = 0
基于上述逻辑,我想构建一个数据框
Based on above logic, I want to build a data frame
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 3 0 0 1 1 0 0
如何以脚本形式启动它,还是应该在python中执行此操作?在找出R的功能/它拥有的机器学习包数量之后,我试图熟悉R。
我的目标是建立一个分类器,但首先我需要熟悉R:)
How can I start this as a script, or should I do this in python? I am trying to get familiar with R after finding out how powerful it is/number of machine learning packages it has. My goal is to build a classifier but first I need be familiar with R :)
推荐答案
另一个简单的答案是考虑到极值:
Another cut answer that takes into account extrema:
dat <- read.table("clipboard", header=TRUE)
cuts <- apply(dat, 2, cut, c(-Inf,seq(0.5, 1, 0.1), Inf), labels=0:6)
cuts[cuts=="6"] <- "0"
cuts <- as.data.frame(cuts)
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 3 0 0 1 1 0 0
2 0 0 5 0 2 2 0
3 1 0 2 0 0 1 0
4 0 0 3 0 1 1 0
5 1 3 1 0 4 0 0
6 0 0 1 0 0 0 0
说明
剪切函数拆分为垃圾箱取决于您指定的削减。因此,让我们以1:10分别将其分割为3、5和7。
Explanation
The cut function splits into bins depending on the cuts you specify. So let's take 1:10 and split it at 3, 5 and 7.
cut(1:10, c(3, 5, 7))
[1] <NA> <NA> <NA> (3,5] (3,5] (5,7] (5,7] <NA> <NA> <NA>
Levels: (3,5] (5,7]
您会看到它是如何影响休息时间之间的水平的。 't include 3(有一个 include.lowest
参数将包含它)。但是这些是组的糟糕名称,我们称它们为组1和2。
You can see how it has made a factor where the levels are those in between the breaks. Also notice it doesn't include 3 (there's an include.lowest
argument which will include it). But these are terrible names for groups, let's call them group 1 and 2.
cut(1:10, c(3, 5, 7), labels=1:2)
[1] <NA> <NA> <NA> 1 1 2 2 <NA> <NA> <NA>
更好,但是NA是什么呢?它们在我们的范围之内并且不计算在内。要计算它们,在我的解决方案中,我添加了-infinity和infinity,因此所有点都包括在内。休息,我们将需要更多标签:
Better, but what's with the NAs? They are outside our boundaries and not counted. To count them, in my solution, I added -infinity and infinity, so all points would be included. Notice that as we have more breaks, we'll need more labels:
x <- cut(1:10, c(-Inf, 3, 5, 7, Inf), labels=1:4)
[1] 1 1 1 2 2 3 3 4 4 4
Levels: 1 2 3 4
好,但我们不想4(根据您的问题)。我们希望所有4s都属于第1组。因此,让我们删除标记为 4的条目。
Ok, but we didn't want 4 (as per your problem). We wanted all the 4s to be in group 1. So let's get rid of the entries which are labelled '4'.
x[x=="4"] <- "1"
[1] 1 1 1 2 2 3 3 1 1 1
Levels: 1 2 3 4
这与我之前所做的略有不同,请注意,我删除了末尾所有的最后一个标签,但是我已经做到了这样,您就可以更好地了解 cut
的工作原理。
This is slightly different to what I did before, notice I took away all the last labels at the end before, but I've done it this way here so you can better see how cut
works.
确定,应用
函数。到目前为止,我们一直在单个矢量上使用cut。但您希望将其用于向量集合:数据框的每一列。这就是 apply
的第二个参数。 1将功能应用于所有行,2适用于所有列。将 cut
函数应用于数据框的每一列。 apply函数中 cut
之后的所有内容只是 cut
的参数,我们在上面已经讨论过。
Ok, the apply
function. So far, we've been using cut on a single vector. But you want it used on a collection of vectors: each column of your data frame. That's what the second argument of apply
does. 1 applies the function to all rows, 2 applies to all columns. Apply the cut
function to each column of your data frame. Everything after cut
in the apply function are just arguments to cut
, which we discussed above.
希望有帮助。
这篇关于在数据框上定义和应用自定义容器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!