在R中快速分类字符向量 [英] Quickly Categorizing Character Vector in R

查看：342 发布时间：2017/3/26 2:32:21 r for-loop dataframe

本文介绍了在R中快速分类字符向量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含杂乱的字符数据列的数据集。我想把它转换成分析数据进行分析。

  carData<  -  data.frame（car = c（ （2001，1994，2004，1980，2000）的野马，丰田特特，M3，大同240Z，雪佛兰马利布 ）
 
汽车年
 1野马2001 
 2丰田Tercel 1994 
 3 M3 2004 
 4 Datsun 240Z 1980 
 5 Chevy Malibu 2000

我已经创建了一些列表来帮助这一点，一个包含搜索字符串列表，另一个与相关类别。

  cars<  -  c（Mustang，Toyota，M3雪佛兰）
 make < -  c（Ford，Toyota，BMW，Chevrolet）

我的目的是循环列表，并将类别分配到一个新变量中。

  <  -  function（df，searchString，category）{
 df $ make < - OTHER
 for（i in seq（1，length（searchString），1））{
列表& lt;  -  grep（searchString [i]，df [，1]，ignore.case = TRUE）
 if（length（list）> 0）{
 for（j in seq（1，length（list），1））{
 df $ make [list [j]]<  -  category [i] 
} 
} 
} 
 df 
} 
 
 cleanCarData<  - 分类（carData，cars，make）

输出是：

 汽车年赚
 1野马2001福特
 2丰田Tercel 1994丰田
 3 M3 2004宝马
 4 Datsun 240Z 1980其他
 5雪佛兰马利布2000 Chevorlet

我的代码工作，我的问题是我的数据有〜1M行，需要3个小时才能完成。相反，如果我为每个创建一个排序的语句，则需要不到3分钟才能完成所有这些。

  df $ make <  - OTHER
 df $ make [grep（Mustang，df $ car，ignore.case = TRUE）]<  - Ford
 df $ make [grep ...]

我有50个搜索字符串到目前为止，可以很容易地有100个，因为我的方式通过数据。

解决方案

您可以通过消除内部循环来使事情更好一些

 分类<  -  function（df，searchString，category）{
 df $ make<  - OTHER
 for（i in seq_along（searchString））{
 list < -  grep（searchString [i]，df [，1]，ignore.case = TRUE）
 if（length（list）> 0 ）{
 df $ make [list]<  -  category [i] 
} 
} 
 df 
}

这是很难按比例进行测试，看看是否大部分时间都花在你的时间，因为你的样本数据不是很大。 p>

I have a dataset with a column of messy character data. I'd like to convert it to factorial data for analysis.

carData <- data.frame(car=c("Mustang", "Toyota Tercel", "M3", "Datsun 240Z", "Chevy Malibu"), 
                 year=c("2001", "1994", "2004", "1980", "2000"))

            car year
1       Mustang 2001
2 Toyota Tercel 1994
3            M3 2004
4   Datsun 240Z 1980
5  Chevy Malibu 2000

I've created a couple of lists to aid with this, one with a list of search strings, and another with the associated categories.

cars <- c("Mustang", "Toyota", "M3", "Chevy")
make <- c("Ford", "Toyota", "BMW", "Chevrolet")

My intent is to loop over the list and assign the category in a new variable.

categorize <- function(df, searchString, category) {
  df$make <- "OTHER"
  for(i in seq(1, length(searchString), 1)) {
    list <- grep(searchString[i], df[,1], ignore.case=TRUE)
    if (length(list) > 0) {
      for(j in seq(1, length(list), 1)) {
        df$make[list[j]] <- category[i]
      }
    }
  }
  df
}

cleanCarData <- categorize(carData, cars, make)

Output is:

            car year      make
1       Mustang 2001      Ford
2 Toyota Tercel 1994    Toyota
3            M3 2004       BMW
4   Datsun 240Z 1980     OTHER
5  Chevy Malibu 2000 Chevorlet

My code works, my issue is that my data has ~1M rows and it takes ~3 hours to complete. Conversely, if I create a lined statement for each, it takes less than 3 minutes to complete all of them.

df$make <- "OTHER"
df$make[grep("Mustang", df$car, ignore.case=TRUE)] <- "Ford"
df$make[grep...]

I have 50 search strings so far and could easily have 100 more as I work my way through the data. Is there a good compromise between maintainable code and performance?

解决方案

You can make things better by eliminating the inner loop

categorize <- function(df, searchString, category) {
  df$make <- "OTHER"
  for(i in seq_along(searchString)) {
    list <- grep(searchString[i], df[,1], ignore.case=TRUE)
    if (length(list) > 0) {
      df$make[list] <- category[i]
    }
  }
  df
}

This is hard to test at scale to see if that'a where most of your time is spent because your sample data isn't very large.

这篇关于在R中快速分类字符向量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在R中快速分类字符向量 [英] Quickly Categorizing Character Vector in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在R中快速分类字符向量 [英] Quickly Categorizing Character Vector in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭