在数据框上定义和应用自定义 bin [英] Define and apply custom bins on a dataframe
问题描述
使用 python 我创建了以下包含相似值的数据框:
Using python I have created following data frame which contains similarity values:
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000
2 0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000
3 0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353
4 0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000
5 0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000
6 0.029 0.302 0.558 0.08547009 0.3928234 0.4603175 0.00000000
我正在尝试编写一个 R 脚本来生成另一个反映 bin 的数据框,但是如果该值高于 0.5,则我的分箱条件适用
I am trying to write a R script to generate another data frame that reflects the bins, but my condition of binning applies if the value is above 0.5 such that
伪代码:
if (cosinFcolor > 0.5 & cosinFcolor <= 0.6)
bin = 1
if (cosinFcolor > 0.6 & cosinFcolor <= 0.7)
bin = 2
if (cosinFcolor > 0.7 & cosinFcolor =< 0.8)
bin = 3
if (cosinFcolor > 0.8 & cosinFcolor <=0.9)
bin = 4
if (cosinFcolor > 0.9 & cosinFcolor <= 1.0)
bin = 5
else
bin = 0
基于上面的逻辑,我想构建一个数据框
Based on above logic, I want to build a data frame
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 3 0 0 1 1 0 0
如何将其作为脚本启动,或者我应该在 python 中执行此操作?在发现它有多强大/它拥有的机器学习包的数量后,我试图熟悉 R.我的目标是构建一个分类器,但首先我需要熟悉 R :)
How can I start this as a script, or should I do this in python? I am trying to get familiar with R after finding out how powerful it is/number of machine learning packages it has. My goal is to build a classifier but first I need be familiar with R :)
推荐答案
另一个考虑极值的简单答案:
Another cut answer that takes into account extrema:
dat <- read.table("clipboard", header=TRUE)
cuts <- apply(dat, 2, cut, c(-Inf,seq(0.5, 1, 0.1), Inf), labels=0:6)
cuts[cuts=="6"] <- "0"
cuts <- as.data.frame(cuts)
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 3 0 0 1 1 0 0
2 0 0 5 0 2 2 0
3 1 0 2 0 0 1 0
4 0 0 3 0 1 1 0
5 1 3 1 0 4 0 0
6 0 0 1 0 0 0 0
说明
cut 函数根据您指定的切割分成多个 bin.所以让我们把 1:10 分成 3、5 和 7.
Explanation
The cut function splits into bins depending on the cuts you specify. So let's take 1:10 and split it at 3, 5 and 7.
cut(1:10, c(3, 5, 7))
[1] <NA> <NA> <NA> (3,5] (3,5] (5,7] (5,7] <NA> <NA> <NA>
Levels: (3,5] (5,7]
您可以看到它是如何产生一个因素,其中水平是休息之间的水平.还要注意它不包含 3(有一个 include.lowest
参数将包含它).但是对于团体来说,这些名字太糟糕了,让我们称它们为第 1 组和第 2 组.
You can see how it has made a factor where the levels are those in between the breaks. Also notice it doesn't include 3 (there's an include.lowest
argument which will include it). But these are terrible names for groups, let's call them group 1 and 2.
cut(1:10, c(3, 5, 7), labels=1:2)
[1] <NA> <NA> <NA> 1 1 2 2 <NA> <NA> <NA>
更好,但是 NA 怎么了?它们在我们的边界之外,没有被计算在内.为了计算它们,在我的解决方案中,我添加了 -infinity 和 infinity,因此将包括所有点.请注意,随着我们有更多的休息时间,我们将需要更多的标签:
Better, but what's with the NAs? They are outside our boundaries and not counted. To count them, in my solution, I added -infinity and infinity, so all points would be included. Notice that as we have more breaks, we'll need more labels:
x <- cut(1:10, c(-Inf, 3, 5, 7, Inf), labels=1:4)
[1] 1 1 1 2 2 3 3 4 4 4
Levels: 1 2 3 4
好的,但我们不想要 4 个(根据您的问题).我们希望所有 4 都在第 1 组中.所以让我们去掉标记为4"的条目.
Ok, but we didn't want 4 (as per your problem). We wanted all the 4s to be in group 1. So let's get rid of the entries which are labelled '4'.
x[x=="4"] <- "1"
[1] 1 1 1 2 2 3 3 1 1 1
Levels: 1 2 3 4
这与我之前所做的略有不同,请注意我之前删除了最后所有最后的标签,但我在这里这样做是为了让您更好地了解 cut
是如何工作的.
This is slightly different to what I did before, notice I took away all the last labels at the end before, but I've done it this way here so you can better see how cut
works.
好的,apply
函数.到目前为止,我们一直在对单个向量使用 cut .但是您希望它用于一组向量:数据框的每一列.这就是 apply
的第二个参数所做的.1 将函数应用于所有行,2 应用于所有列.将 cut
函数应用于数据框的每一列.apply 函数中 cut
之后的所有内容都只是我们上面讨论的 cut
的参数.
Ok, the apply
function. So far, we've been using cut on a single vector. But you want it used on a collection of vectors: each column of your data frame. That's what the second argument of apply
does. 1 applies the function to all rows, 2 applies to all columns. Apply the cut
function to each column of your data frame. Everything after cut
in the apply function are just arguments to cut
, which we discussed above.
希望有所帮助.
这篇关于在数据框上定义和应用自定义 bin的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!