因子列上的sum(。)返回不正确的结果 [英] sum(.) on a factor column returns incorrect result

查看:181
本文介绍了因子列上的sum(。)返回不正确的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里有一个奇怪的修复。我使用 data.table 作为一个非常常规的任务,但有一些我不能解释。我已经找出了解决这个问题的方法,但我认为仍然重要的是,我明白这里发生了什么。



这段代码将数据带入工作区

 库(XML)
库(data.table)
theurl < - //goo.gl/hOKW3a
tables< - readHTMLTable(theurl)
new.Res< - data.table(tables [[2]] [4:5] [ - 2),])
suppressWarnings(names(new.Res)< - c(Party,Cases))

这里有两个列, Party Cases 。两者都具有默认类 factor 。虽然 Cases 应该是 numeric 。最后,我只需要 Cases 的总和 Party 。所以这样的东西应该工作:

  new.Res [,sum(Cases),by = Party] 

但这没有给出正确的答案。我认为如果我将 Cases 的类从因子更改为数字。所以我试了下面的:

  new.Res [,Cases:= as.numeric(Cases)] 
new .Res [,sum(Cases),by = Party]

但我得到了同样的错误答案。我意识到问题发生在将 Cases 的类从因子更改为 numeric 。所以我尝试了一种不同的方法,它的工作原理:



Step1:重新初始化数据:



< theurl < - http://goo.gl/hOKW3a
tables< - readHTMLTable(theurl)
new.Res< - data.table(tables [ 2]] [4:5] [ - (1:2),])
suppressWarnings(names(new.Res)< - c(Party,Cases))

步骤2:使用不同的方法将类从 factor 更改为 numeric

  new.Res [,Cases:= strtoi ] 
new.Res [,sum(Cases),by = Party]

精细!但是,我不知道前两种方法有什么问题。 c

数字整数是通过字符。这是因为在内部,一个因素是一个整数索引(指的是 levels 向量)。当你告诉R将它转换为 numeric 时,它将简单地转换相关的索引,而不是尝试转换级别标签。



简短的答案:do Cases:= as.numeric(as.character(Cases))



修改?factor 帮助页面建议 as.numeric(levels(Cases))[Cases] 更高效。 h / t @见注释。


I am in a strange fix here. I am using data.table for a very routine task, but there is something that I am not able to explain. I have figured out a way around the problem, but I think it is still important for me to understand what is going wrong here.

This code will bring the data into workspace:

library(XML)
library(data.table)
theurl <- "http://goo.gl/hOKW3a"
tables <- readHTMLTable(theurl)
new.Res <- data.table(tables[[2]][4:5][-(1:2),])
suppressWarnings(names(new.Res) <- c("Party","Cases"))

There are two columns here, Party and Cases. Both of which have the default class of factor. Although, Cases should be numeric. Ultimately, I just want the sum of Cases for each Party. So something like this should work:

new.Res[,sum(Cases), by=Party]

But this doesn't give the right answer. I thought that it'll work if I change the class of Cases from factor to numeric. So I tried the following:

new.Res[,Cases := as.numeric(Cases)]
new.Res[,sum(Cases), by=Party]

But I got the same incorrect answer. I realized that the problem is happening in changing the class of Cases from factor to numeric. So I tried a different method, and it worked:

Step1: Reinitialize the data:

theurl <- "http://goo.gl/hOKW3a"
tables <- readHTMLTable(theurl)
new.Res <- data.table(tables[[2]][4:5][-(1:2),])
suppressWarnings(names(new.Res) <- c("Party","Cases"))

Step2: Use a different method to change the class from factor to numeric:

new.Res[,Cases := strtoi(Cases)]
new.Res[,sum(Cases), by=Party]

This works fine! However, I am not sure what's wrong with the first two methods. What am I missing?

解决方案

The correct way to convert from factor to numeric or integer is to go through character. This is because internally, a factor is an integer index (that refers to a levels vector). When you tell R to convert it to numeric it will simply convert the underlying index, not try to convert the level label.

Short answer: do Cases:=as.numeric(as.character(Cases)).

Edit: Alternatively the ?factor help page suggests as.numeric(levels(Cases))[Cases] as more efficient. h/t @Gsee in the comments.

这篇关于因子列上的sum(。)返回不正确的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆