在R中快速分类字符向量 [英] Quickly Categorizing Character Vector in R
问题描述
carData< - data.frame(car = c( (2001,1994,2004,1980,2000)的野马,丰田特特,M3,大同240Z,雪佛兰马利布 )
汽车年
1野马2001
2丰田Tercel 1994
3 M3 2004
4 Datsun 240Z 1980
5 Chevy Malibu 2000
我已经创建了一些列表来帮助这一点,一个包含搜索字符串列表,另一个与相关类别。
cars< - c(Mustang,Toyota,M3雪佛兰)
make < - c(Ford,Toyota,BMW,Chevrolet)
我的目的是循环列表,并将类别分配到一个新变量中。
< - function(df,searchString,category){
df $ make < - OTHER
for(i in seq(1,length(searchString),1)){
列表& lt; - grep(searchString [i],df [,1],ignore.case = TRUE)
if(length(list)> 0){
for(j in seq(1,length(list),1)){
df $ make [list [j]]< - category [i]
}
}
}
df
}
cleanCarData< - 分类(carData,cars,make)
输出是:
汽车年赚
1野马2001福特
2丰田Tercel 1994丰田
3 M3 2004宝马
4 Datsun 240Z 1980其他
5雪佛兰马利布2000 Chevorlet
我的代码工作,我的问题是我的数据有〜1M行,需要3个小时才能完成。相反,如果我为每个创建一个排序的语句,则需要不到3分钟才能完成所有这些。
df $ make < - OTHER
df $ make [grep(Mustang,df $ car,ignore.case = TRUE)]< - Ford
df $ make [grep ...]
我有50个搜索字符串到目前为止,可以很容易地有100个,因为我的方式通过数据。
您可以通过消除内部循环来使事情更好一些
分类< - function(df,searchString,category){
df $ make< - OTHER
for(i in seq_along(searchString)){
list < - grep(searchString [i],df [,1],ignore.case = TRUE)
if(length(list)> 0 ){
df $ make [list]< - category [i]
}
}
df
}
这是很难按比例进行测试,看看是否大部分时间都花在你的时间,因为你的样本数据不是很大。 p>
I have a dataset with a column of messy character data. I'd like to convert it to factorial data for analysis.
carData <- data.frame(car=c("Mustang", "Toyota Tercel", "M3", "Datsun 240Z", "Chevy Malibu"),
year=c("2001", "1994", "2004", "1980", "2000"))
car year
1 Mustang 2001
2 Toyota Tercel 1994
3 M3 2004
4 Datsun 240Z 1980
5 Chevy Malibu 2000
I've created a couple of lists to aid with this, one with a list of search strings, and another with the associated categories.
cars <- c("Mustang", "Toyota", "M3", "Chevy")
make <- c("Ford", "Toyota", "BMW", "Chevrolet")
My intent is to loop over the list and assign the category in a new variable.
categorize <- function(df, searchString, category) {
df$make <- "OTHER"
for(i in seq(1, length(searchString), 1)) {
list <- grep(searchString[i], df[,1], ignore.case=TRUE)
if (length(list) > 0) {
for(j in seq(1, length(list), 1)) {
df$make[list[j]] <- category[i]
}
}
}
df
}
cleanCarData <- categorize(carData, cars, make)
Output is:
car year make
1 Mustang 2001 Ford
2 Toyota Tercel 1994 Toyota
3 M3 2004 BMW
4 Datsun 240Z 1980 OTHER
5 Chevy Malibu 2000 Chevorlet
My code works, my issue is that my data has ~1M rows and it takes ~3 hours to complete. Conversely, if I create a lined statement for each, it takes less than 3 minutes to complete all of them.
df$make <- "OTHER"
df$make[grep("Mustang", df$car, ignore.case=TRUE)] <- "Ford"
df$make[grep...]
I have 50 search strings so far and could easily have 100 more as I work my way through the data. Is there a good compromise between maintainable code and performance?
You can make things better by eliminating the inner loop
categorize <- function(df, searchString, category) {
df$make <- "OTHER"
for(i in seq_along(searchString)) {
list <- grep(searchString[i], df[,1], ignore.case=TRUE)
if (length(list) > 0) {
df$make[list] <- category[i]
}
}
df
}
This is hard to test at scale to see if that'a where most of your time is spent because your sample data isn't very large.
这篇关于在R中快速分类字符向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!