将函数应用于数据框中的每一列,观察每一列的现有数据类型 [英] Apply function to each column in a data frame observing each columns existing data type

查看:33
本文介绍了将函数应用于数据框中的每一列,观察每一列的现有数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为了解我的数据的一部分,我正在尝试获取大型数据框中每一列的最小值/最大值.我的第一次尝试是:

I'm trying to get the min/max for each column in a large data frame, as part of getting to know my data. My first try was:

apply(t,2,max,na.rm=1)

它将所有内容都视为字符向量,因为前几列是字符类型.因此,某些数字列的最大值显示为 " -99.5".

It treats everything as a character vector, because the first few columns are character types. So max of some of the numeric columns is coming out as " -99.5".

然后我尝试了这个:

sapply(t,max,na.rm=1)

但它抱怨max 对于因素没有意义.(lapply 是一样的.)让我感到困惑的是 apply 认为 max 对因素非常有意义,例如它为第 1 列返回了ZEBRA".

but it complains about max not meaningful for factors. (lapply is the same.) What is confusing me is that apply thought max was perfectly meaningful for factors, e.g. it returned "ZEBRA" for column 1.

顺便说一句,我看了一下在POSIXct的向量上使用sapply 并且其中一个答案是当您使用 sapply 时,您的对象会被强制转换为数字,...".这是发生在我身上的事情吗?如果是这样,是否有不强制的替代应用函数?当然这是一个普遍的需求,因为数据框类型的一个关键特性是每一列都可以是不同的类型.

BTW, I took a look at Using sapply on vector of POSIXct and one of the answers says "When you use sapply, your objects are coerced to numeric,...". Is this what is happening to me? If so, is there an alternative apply function that does not coerce? Surely it is a common need, as one of the key features of the data frame type is that each column can be a different type.

推荐答案

如果是有序因素",事情就会不同.这并不是说我喜欢有序因素",我不喜欢,只是说某些关系是为有序因素"定义的,而不是为因素"定义的.因子被认为是普通的分类变量.您正在看到因素的自然排序顺序,即您所在地区的字母词汇顺序.如果您想为每一列、日期和因素等自动强制数字",请尝试:

If it were an "ordered factor" things would be different. Which is not to say I like "ordered factors", I don't, only to say that some relationships are defined for 'ordered factors' that are not defined for "factors". Factors are thought of as ordinary categorical variables. You are seeing the natural sort order of factors which is alphabetical lexical order for your locale. If you want to get an automatic coercion to "numeric" for every column, ... dates and factors and all, then try:

sapply(df, function(x) max(as.numeric(x)) )   # not generally a useful result

或者,如果您想先测试因子并按预期返回:

Or if you want to test for factors first and return as you expect then:

sapply( df, function(x) if("factor" %in% class(x) ) { 
            max(as.numeric(as.character(x)))
            } else { max(x) } )

@Darrens 评论效果更好:

@Darrens comment does work better:

 sapply(df, function(x) max(as.character(x)) )  

max 使用字符向量确实成功.

max does succeed with character vectors.

这篇关于将函数应用于数据框中的每一列,观察每一列的现有数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆