如何在r中编写函数以对记录进行标定? [英] How do I write a function in r to do cacluations on a record?

查看:79
本文介绍了如何在r中编写函数以对记录进行标定?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在C#中,我习惯了数据集和当前记录的概念. 对于我来说,很容易在当前记录中编写带有条件的calc-price函数.

In C# I am used to the concept of a data set and a current record. It would be easy for me to write a complicated calc-price function with conditions on the current record.

我在理解如何在r中执行此操作时遇到了麻烦.

I am having trouble understanding how to do this in r.

我尝试了以下

   train <- read.csv("Train.csv" )
   df <- as.data.frame.matrix(train)
   v = c(  df$Fuel.Type ,df$No.Gears)
   names(v ) <- c( "FuelType" ,"NoGears")
   df$FEType = FEType( v)

我的函数定义为

FEType <- function(v    ){
  ret="Low"
  if (v["FuelType"]=='G') {
    ret ="High"
  }
  return(ret)
}

这不符合我的预期 当我检查v时,我发现它包含的是总计,而不是我期望的当前行.

This is not working how I expected and when I examine v I see that it contains aggregate totals rather than the current row I expected.

我要去哪里错了?

在问题中此处请参阅最后一段中的一些提示.

In the question here I see some hints in the last paragraph.

要重现问题,表明我想做什么,

To reproduce the problem, indicating what I want to do, I have

IsPretty <-function(PetalWidth){
  if (PetalWidth  >0.3) return("Y")
  return("N")
}

df <- iris
df$Pretty = IsPretty(df$Petal.Width)
    

这给出了错误

条件的长度为> 1,并且只会使用第一个元素

the condition has length > 1 and only the first element will be used

这促使我研究向量.但是我不确定这是正确的方向.

Which led me to look into vectors. But I am not confident that is the right direction.

[更新]

我习惯于思考表格和当前记录. 因此我在想

I am used to thinking of tables and current records. Thus I was thinking that

df$Pretty = IsPretty(df$Petal.Width)

将具有通过计算的isPretty属性向我的数据框中添加一列的效果

would have the effect of adding a column to my data frame with the calculated isPretty property

为什么在计算中不包括条件?

Why can I not include if conditions in my calculation?

推荐答案

向量化是您在R中需要习惯的最基本的(也是最不寻常的)事情之一.许多(大多数?)R运算是向量化的.但是有些事情不是-if(){}else{}是非矢量化的事情之一.它用于控制流(是否运行代码块),而不用于矢量操作. ifelse()是用于向量的单独函数,其中第一个自变量是"test",而第二个和第三个自变量是"if yes".和如果不是",则为否".结果.测试是向量,返回的值是测试中每个项目的适当是/否"结果. 结果将与测试的长度相同.

Vectorization is one of the most fundamental (and unusual) things you'll need to get used to in R. Many (most?) R operations are vectorized. But a few things aren't - and if(){}else{} is one of the non-vectorized things. It's used for control flow (whether or not to run a code block) not for vector operations. ifelse() is a separate function that is used for vectors, where the first argument is a "test", and the 2nd and 3rd arguments are the "if yes" and "if no" results. The test is a vector, and the returned value is the appropriate yes/no result for each item in test. The result will be the same length as the test.

所以我们将这样编写您的IsPretty函数:

So we would write your IsPretty function like this:

IsPretty <- function(PetalWidth){
  return(ifelse(PetalWidth > 0.3, "Y", "N"))
}

df <- iris
df$Pretty = IsPretty(df$Petal.Width)

与测试条件长度为1的if(){...}else{...}块形成对比,并且可以在...中运行任意代码-可能返回比测试更大的结果,或者返回更小的结果,或者没有结果-可能修改其他对象...您可以在if(){}else()内部执行任何操作,但测试条件的长度必须为1.

Contrast to an if(){...}else{...} block where the test condition is of length one, and arbitrary code can be run in the ... - may return a bigger result than the test, or a smaller result, or no result - might modify other objects... You can do anything inside if(){}else(), but the test condition must have length 1.

您可以一次使用您的IsPretty函数-它对于任何一行都可以正常使用.因此,我们可以将其放入如下所示的循环中,一次检查一行,一次给if()一个测试,一次分配一个结果.但是R已针对矢量化进行了优化,这会明显变慢并且是个坏习惯.

You could use your IsPretty function one row at a time - it will work fine for any one row. So we could put it in a loop as below, checking one row at time, giving if() one test at a time, assigning results one at a time. But R is optimized for vectorization, and this will be noticeably slower and is a bad habit.

IsPrettyIf <-function(PetalWidth){
  if (PetalWidth  >0.3) return("Y")
  return("N")
}

for(i in 1:nrow(df)) {
  df$PrettyLoop[i] = IsPrettyIf(df$Petal.Width[i])
}

下面的基准显示矢量化版本的速度提高了50倍.这是一个简单的案例,并且数据如此之小,所以没什么大不了的,但是在较大的数据上,或者在操作更为复杂的情况下,矢量化代码和非矢量化代码之间的差异可能是几分钟到几天.

A benchmark below shows that the vectorized version is 50x faster. This is such a simple case and such small data that it doesn't much matter, but on larger data, or with more complex operations the difference between vectorized and non-vectorized code can be minutes vs days.

microbenchmark::microbenchmark(
  loop = {
    for(i in 1:nrow(df)) {
      df$PrettyLoop[i] = IsPrettyIf(df$Petal.Width[i])
    }
  },
  vectorized = {
    df$Pretty = IsPretty(df$Petal.Width)    
  }
)
Unit: microseconds
       expr    min     lq     mean median      uq     max neval
       loop 3898.9 4365.6 5880.623 5442.3 7041.10 11344.6   100
 vectorized   47.7   59.6  112.288   67.4   83.85  1819.4   100

对于R学习者来说这是一个常见的障碍-您可以在Stack Overflow上找到许多问题,人们在需要ifelse()时正在使用if(){}else{},反之亦然. 为什么ifelse无法返回向量?是来自问题另一面的常见问题解答.

This is a common bump for R learners - you can find many questions on Stack Overflow where people are using if(){}else{} when they need ifelse() or vice versa. Why can't ifelse return vectors? is a FAQ coming from the opposite side of the problem.

df <- iris

## The condition has length equal to the number of rows in the data frame
df$Petal.Width > 0.3
#>   [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [13] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
## ... truncated


## R warns us that only the first value (which happens to be FALSE) is used
result = if(df$Petal.Width > 0.3) {"Y"} else {"N"}
#> Warning in if (df$Petal.Width > 0.3) {: the condition has length > 1 and only
#> the first element will be used

## So the result is a single "N"
result  
#> [1] "N"

length(result)
#> [1] 1


## R "recycles" inputs that are of insufficient length
## so we get a full column of "N"
df$Pretty = result
head(df)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Pretty
#> 1          5.1         3.5          1.4         0.2  setosa      N
#> 2          4.9         3.0          1.4         0.2  setosa      N
#> 3          4.7         3.2          1.3         0.2  setosa      N
#> 4          4.6         3.1          1.5         0.2  setosa      N
#> 5          5.0         3.6          1.4         0.2  setosa      N
#> 6          5.4         3.9          1.7         0.4  setosa      N

reprex软件包(v0.3.0)创建于2020-11-08 sup>

Created on 2020-11-08 by the reprex package (v0.3.0)

这篇关于如何在r中编写函数以对记录进行标定?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆