将数据框列转换为带索引的因子 [英] Convert data frame columns to factor with indexing

查看:229
本文介绍了将数据框列转换为带索引的因子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我把一些结果放在数据框中。我有一些因子列和许多数字列。根据这个问题。

I have some results that I put in a data frame. I have some factor columns and many numeric columns. I can easily convert the numeric columns to numeric with indexing, as per the answer to this question.

#create example data
df = data.frame(replicate(1000,sample(1:10,1000,rep=TRUE)))
df$X1 = LETTERS[df$X1]
df$X2 = LETTERS[df$X2]
df$X3 = LETTERS[df$X3]
df[-1] <- sapply(df[-1], function(x) ifelse(runif(length(x)) < 0.1, NA, x))

#find columns that are factors
factornames = c("X1", "X2", "X3")
factorfilt = names(df) %in% factornames

#convert non-factor columns to numeric
df[, !factorfilt] = as.numeric(as.character(unlist(df[, !factorfilt])))

但是当我想为我的因子列做同样的事情时,我无法获得相同的索引:

But when I want to do the same for my factor columns, I cant get the same indexing to work:

#convert factor columns to factor
df[, factorfilt] = as.factor(as.character(unlist(df[, factorfilt])))
class(df$X1)

[1] "character"

df[, factorfilt] = as.factor(as.character(df[, factorfilt]))
class(df$X1)

[1] "character"

df[, factorfilt] = as.factor(unlist(df[, factorfilt]))
class(df$X1)

[1] "character"

df[, factorfilt] = as.factor(df[, factorfilt]) 

Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

所有这些都返回character如果我叫 class(df $ X1),而如果我运行 df $ X1 = as.factor(df $ X1)它返回factor

All of these return "character" if I call class(df$X1), while if I run df$X1= as.factor(df$X1) it returns "factor".

当我调用 as.factor 时,为什么这种方式的索引不起作用,但如果我调用 as.numeric

Why does indexing this way not work when I call as.factor, but does if I call as.numeric?

推荐答案

你应该观察一些你正在做的事情的行为方面。像你一样定义你的数据:

You should observe some behavioral aspects of what you are doing. Defining your data as you did:

df = data.frame(replicate(1000,sample(1:10,1000,rep=TRUE)))
df$X1 = LETTERS[df$X1]
df$X2 = LETTERS[df$X2]
df$X3 = LETTERS[df$X3]
df[-1] <- sapply(df[-1], function(x) ifelse(runif(length(x)) < 0.1, NA, x))

factornames = c("X1", "X2", "X3")
factorfilt = names(df) %in% factornames
df[, !factorfilt] = as.numeric(as.character(unlist(df[, !factorfilt])))

现在让我们来看看制作<$ c的结果$ c> X1 , X2 X3 因为你做了,但是不要重新分配它。

Now let's take a look at the result of making the X1, X2, and X3 factors as you did, but let's not reassign it yet.

test <- as.factor(as.character(df[, factorfilt]))
class(test) # "factor"
length(test) # 3

重要需要注意的是 test 不是数据框。它是一个向量,您试图保存数据帧的三列。我认为我们应该质疑将数据帧转换为矢量以存储在数据帧中的智慧。

The important thing to notice here is that test is not a data frame. It's a vector, that you are attempting to save over three columns of a data frame. I think we should question the wisdom of converting a data frame to a vector to store in a data frame.

然后考虑你的第二个任务:

Then consider your second assignment:

test2 <- as.factor(as.character(unlist(df[, factorfilt])))
class(test2) # factor
length(test2) # 3000

同样,这是一个因素,但它有一个完全不同的长度小于 test 。通过让你将它重新分配回 df ,R只是这样,并且只是因为它认识到它可以协调维度。但是,当您尝试将因子推入 X1 X2 X3 ,关于如何处理因子水平存在一个很大的问题。所有三个变量都应该具有相同的水平吗?每个变量是否只有其自身的水平? R不是试图宣布适当的选择是什么,而是忽略它并将其转换回一个角色供你自己处理。

Again, it's a factor, but it has a completely different length than test. R is being kind by letting you reassign this back into df at all, and is only doing so because it recognizes that it can reconcile the dimensions. But when you try to push the factors into X1, X2, and X3, there's a big question about what to do with the factor levels. Should all three variables have the same levels? Should each variable only have the levels present within itself? Instead of attempting to declare what the "appropriate" choice is, R just ignores it and converts it back to a character for you to deal with on your own.

事实以这种方式操纵列有可能意外地改变类是不这样做的一个很好的理由。这在您分配 NA 时很明显。让我们再看一下:

The fact that manipulating columns this way has the potential to change classes unexpectedly is a good reason not to do it. This is evident in your assignment of the NAs. Let's revisit:

df = data.frame(replicate(1000,sample(1:10,1000,rep=TRUE)))
df$X1 = LETTERS[df$X1]
df$X2 = LETTERS[df$X2]
df$X3 = LETTERS[df$X3]

此时, X4 X1000 都是整数类列。当你运行

At this point, X4 through X1000 are all integer class columns. When you run

df[-1] <- sapply(df[-1], function(x) ifelse(runif(length(x)) < 0.1, NA, x))

它们现在都是 character s,然后继续将它们转换为 numeric 。他们甚至不再是原来的班级了。

They are all now characters, and you proceed to convert them to numeric. They aren't even their original class anymore.

如果我们使用 lapply

df[-1] <- lapply(df[-1], function(x) ifelse(runif(length(x)) < 0.1, NA, x))

原始类被保留,无需将它们转换回一个数字类。同样,我们可以轻易地将 X1 通过 X3 转换为具有

the original classes are preserved and there's no need to convert them back to a numeric class. Similarly, we can readily convert X1 through X3 to factors with

df[, factorfilt] <- lapply(df[, factorfilt], as.factor)

作为一般规则,最好将列中的数据作为不同的列进行操作。一旦开始在多个列上分配单个向量,就会进入一个恶作剧的黑暗世界。

As a general rule, it is better to manipulate the data in columns as distinct columns. Once you begin assigning a single vector over multiple columns, you enter a dark world of mischief.

这篇关于将数据框列转换为带索引的因子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆