在R中快速分类字符向量 [英] Quickly Categorizing Character Vector in R

查看:342
本文介绍了在R中快速分类字符向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含杂乱的字符数据列的数据集。我想把它转换成分析数据进行分析。

  carData<  -  data.frame(car = c( (2001,1994,2004,1980,2000)的野马,丰田特特,M3,大同240Z,雪佛兰马利布 )

汽车年
1野马2001
2丰田Tercel 1994
3 M3 2004
4 Datsun 240Z 1980
5 Chevy Malibu 2000

我已经创建了一些列表来帮助这一点,一个包含搜索字符串列表,另一个与相关类别。

  cars<  -  c(Mustang,Toyota,M3雪佛兰)
make < - c(Ford,Toyota,BMW,Chevrolet)

我的目的是循环列表,并将类别分配到一个新变量中。

  <  -  function(df,searchString,category){
df $ make < - OTHER
for(i in seq(1,length(searchString),1)){
列表& lt; - grep(searchString [i],df [,1],ignore.case = TRUE)
if(length(list)> 0){
for(j in seq(1,length(list),1)){
df $ make [list [j]]< - category [i]
}
}
}
df
}

cleanCarData< - 分类(carData,cars,make)

输出是:

 汽车年赚
1野马2001福特
2丰田Tercel 1994丰田
3 M3 2004宝马
4 Datsun 240Z 1980其他
5雪佛兰马利布2000 Chevorlet

我的代码工作,我的问题是我的数据有〜1M行,需要3个小时才能完成。相反,如果我为每个创建一个排序的语句,则需要不到3分钟才能完成所有这些。

  df $ make <  - OTHER
df $ make [grep(Mustang,df $ car,ignore.case = TRUE)]< - Ford
df $ make [grep ...]

我有50个搜索字符串到目前为止,可以很容易地有100个,因为我的方式通过数据。

解决方案

您可以通过消除内部循环来使事情更好一些

 分类<  -  function(df,searchString,category){
df $ make< - OTHER
for(i in seq_along(searchString)){
list < - grep(searchString [i],df [,1],ignore.case = TRUE)
if(length(list)> 0 ){
df $ make [list]< - category [i]
}
}
df
}

这是很难按比例进行测试,看看是否大部分时间都花在你的时间,因为你的样本数据不是很大。 p>

I have a dataset with a column of messy character data. I'd like to convert it to factorial data for analysis.

carData <- data.frame(car=c("Mustang", "Toyota Tercel", "M3", "Datsun 240Z", "Chevy Malibu"), 
                 year=c("2001", "1994", "2004", "1980", "2000"))

            car year
1       Mustang 2001
2 Toyota Tercel 1994
3            M3 2004
4   Datsun 240Z 1980
5  Chevy Malibu 2000

I've created a couple of lists to aid with this, one with a list of search strings, and another with the associated categories.

cars <- c("Mustang", "Toyota", "M3", "Chevy")
make <- c("Ford", "Toyota", "BMW", "Chevrolet")

My intent is to loop over the list and assign the category in a new variable.

categorize <- function(df, searchString, category) {
  df$make <- "OTHER"
  for(i in seq(1, length(searchString), 1)) {
    list <- grep(searchString[i], df[,1], ignore.case=TRUE)
    if (length(list) > 0) {
      for(j in seq(1, length(list), 1)) {
        df$make[list[j]] <- category[i]
      }
    }
  }
  df
}

cleanCarData <- categorize(carData, cars, make)

Output is:

            car year      make
1       Mustang 2001      Ford
2 Toyota Tercel 1994    Toyota
3            M3 2004       BMW
4   Datsun 240Z 1980     OTHER
5  Chevy Malibu 2000 Chevorlet

My code works, my issue is that my data has ~1M rows and it takes ~3 hours to complete. Conversely, if I create a lined statement for each, it takes less than 3 minutes to complete all of them.

df$make <- "OTHER"
df$make[grep("Mustang", df$car, ignore.case=TRUE)] <- "Ford"
df$make[grep...]

I have 50 search strings so far and could easily have 100 more as I work my way through the data. Is there a good compromise between maintainable code and performance?

解决方案

You can make things better by eliminating the inner loop

categorize <- function(df, searchString, category) {
  df$make <- "OTHER"
  for(i in seq_along(searchString)) {
    list <- grep(searchString[i], df[,1], ignore.case=TRUE)
    if (length(list) > 0) {
      df$make[list] <- category[i]
    }
  }
  df
}

This is hard to test at scale to see if that'a where most of your time is spent because your sample data isn't very large.

这篇关于在R中快速分类字符向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆