如何将函数应用于多个列以在R中创建多个新列? [英] How to apply a function to multiple columns to create multiple new columns in R?

查看:51
本文介绍了如何将函数应用于多个列以在R中创建多个新列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有此序列列表 aqi_range 和一个数据帧 df :

I've this list of sequences aqi_range and a dataframe df:

aqi_range = list(0:50,51:100,101:250)

df

   PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max
 1      85.6        3      264       75.7         3       240
 2     105.         6      243       76.4         3       191
 3      95.8       19      287       48.4         8       134
 4      85.5       50      166       64.8        32       103
 5      55.9       24      117       46.7        19        77
 6      37.5        6      116       31.3         3        87
 7      26          5       69       15.5         3        49
 8      82.3       34      169       49.6        25       120
 9      170        68      272       133         67       201
10      254       189      323       226        173       269

现在,我已经创建了这两个非常简单的函数,希望将其应用于此数据框,以计算每种污染物的 AQI =空气质量指数.

Now I've created these two pretty simple functions that i want to apply to this dataframe to calculate the AQI=Air Quality Index for each pollutant.

#a = column from a dataframe  **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
min_max_diff <- function(a,b){
        for (i in b){
          if (a %in% i){
           min_val = min(i)
           max_val = max(i)
           return (max_val - min_val)
        }}}

#a = column from a dataframe  **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
c_low <- function(a,b){
      for (i in b){
       if (a %in% i){
        min_val = min(i)
        return(min_val)
          } 
      }}

基本上,第一个函数"min_max_diff"采用列df $ PM10_mean/df $ PM2.5_mean的值,并在列表"aqi_range"中进行检查,然后返回某个值(最小和最大值的差值)可用的顺序).同样,第二个函数"c_low"仅返回序列的最小值.

我想对PM10_mean列应用这种操作(下面定义的公式)以创建新列PM10_AQI:

I want to apply this kind of manipulation (formula defined below) to PM10_mean column to create new columns PM10_AQI:

df$PM10_AQI  = min_max_diff(df$PM10_mean,aqi_range) / (df$PM10_max - df$PM10_min) / * (df$PM10_mean -  df$PM10_min) + c_low(df$PM10_mean,aqi_range)

我希望它能正确解释.

推荐答案

如果您的问题只是如何将给定的转换计算到数据帧中的几列,则可以编写一个for循环,构造涉及的每个变量的名称在使用字符串转换函数的转换中(在这种情况下, sub()很有用),并使用 [表示法(而不是 $ 表示法-因为 [表示法接受字符串以指定列).

If your problem is just how to compute the given transformation to several columns in a data frame, you could write a for loop, construct the name of each variable involved in the transformation using string transformation functions (in this case sub() is useful), and refer to the columns in the data frame using the [ notation (as opposed to the $ notation --since the [ notation accepts strings to specify columns).

下面我展示了这样的代码示例,其中包含少量样本数据和3个观察结果:

Following I show an example of such code with a small sample data with 3 observations:

(请注意,我修改了AQI范围值的定义(现在我只是定义范围更改的中断-假设它们都是整数),以及您的函数 min_max_diff() c_low()折叠成一个函数,返回找到值的AQI范围的最小值和最大值-再次假设AQI值为整数值)

(note that I modified the definition of the AQI range values (now I just define the breaks where the range changes --assuming they are all integers), and your functions min_max_diff() and c_low() which are collapsed into one single function returning the min and max values of the AQI range where the values are found --again this assumes that the AQI values are integer values)

# Definition of the AQI ranges (which are assumed to be based on integer values)
# Note that if the number of AQI ranges is k, the number of breaks is k+1
# Each break value defines the minimum of the range
# The maximum of each range is computed as the "minimum of the NEXT range" - 1
# (again this assumes integer values in AQI ranges)
# The values (e.g. PM10_mean) whose AQI range is searched for are assumed
# to NOT be larger than or equal to the largest break value.
aqi_range_breaks = c(0, 51, 101, 251)

# Example data (top 3 rows of the data frame you provided)
df = data.frame(PM10_mean=c(85.6, 105.0, 95.8),
                PM10_min=c(3, 6, 19),
                PM10_max=c(264, 243, 287),
                PM2.5_mean=c(75.7, 76.4, 48.4),
                PM2.5_min=c(3, 3, 8),
                PM2.5_max=c(240, 191, 134))

# Function that returns the minimum and maximum AQI values
# of the AQI range where the given values are found
# `values`: array of values that are searched for in the AQI ranges
# defined by the second parameter.
# `aqi_range_breaks`: breaks defining the minimum values of each AQI range
# plus one last value defining a value never attained by `values`.
# (all values in this parameter defining the AQI ranges are assumed integer values)
find_aqi_range_min_max <- function(values, aqi_range_breaks){
  aqi_range_groups = findInterval(values, aqi_range_breaks)
  return( list(min=aqi_range_breaks[aqi_range_groups],
               max=aqi_range_breaks[aqi_range_groups + 1] - 1))
}

# Run the variable transformation on the selected `_mean` columns
vars_mean = c("PM10_mean", "PM2.5_mean")
for (vmean in vars_mean) {
  vmin = sub("_mean$", "_min", vmean)
  vmax = sub("_mean$", "_max", vmean)
  vaqi = sub("_mean$", "_AQI", vmean)
  aqi_range_min_max = find_aqi_range_min_max(df[,vmean], aqi_range_breaks)
  df[,vaqi] = (aqi_range_min_max$max - aqi_range_min_max$min) / 
              (df[,vmax] - df[,vmin]) / (df[,vmean] -  df[,vmin]) +
              aqi_range_min_max$min
}

请注意如何使用 findInterval()函数查找值数组的下降范围.这是使您的转换适用于数据框列的关键.

Note how the findInterval() function has been used to find the range where an array of values fall. That was the key to make your transformation work for a data frame column.

此过程的预期输出为:

  PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max  PM10_AQI    PM2.5_AQI
1      85.6        3      264       75.7         3       240  51.00227 51.002843893
2     105.0        6      243       76.4         3       191 101.00635 51.003550930
3      95.8       19      287       48.4         8       134  51.00238  0.009822411

请检查用于计算AQI的公式,因为其中存在语法错误(请查找/* ,在我的代码公式中已将其替换为/).

Please check the formula that computes AQI because you had a syntax error in it (look for / *, which I have replaced with / in the formula in my code).

请注意,在 sub()中使用的正则表达式中使用 $ 来匹配字符串"_ mean" "_ mean" 字符串仅出现在变量名称的末尾.

Note that the use of $ in the regular expression used in sub() to match the string "_mean" is used to replace the "_mean" string only when it occurs at the end of the variable name.

这篇关于如何将函数应用于多个列以在R中创建多个新列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆