将字符转换为数字,不带NA在R中强制 [英] Converting Character to Numeric without NA Coercion in R

查看:1081
本文介绍了将字符转换为数字,不带NA在R中强制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中工作,并有一个数据帧,dd_2006,数字向量。当我第一次导入数据,我需要删除$的,小数点和一些空格从我的变量3:SumOfCost,SumOfCases和SumOfUnits。为了做到这一点,我使用 str_replace_all 。但是,一旦我使用 str_replace_all ,向量被转换为字符。所以我使用as.numeric(var)将向量转换为数字,但是引入了NAs,即使当我运行下面的代码时,我运行as.numeric代码,在向量中没有NA。

I'm working in R and have a dataframe, dd_2006, with numeric vectors. When I first imported the data, I needed to remove $'s, decimal points, and some blank spaces from 3 of my variables: SumOfCost, SumOfCases, and SumOfUnits. To do that, I used str_replace_all. However, once I used str_replace_all, the vectors were converted to characters. So I used as.numeric(var) to convert the vectors to numeric, but NAs were introduced, even though when I ran the code below BEFORE I ran the as.numeric code, there were no NAs in the vectors.

sum(is.na(dd_2006$SumOfCost))
[1] 0
sum(is.na(dd_2006$SumOfCases))
[1] 0
sum(is.na(dd_2006$SumOfUnits))
[1] 0

这是我的代码从导入后,从删除向量中的$。在 str(dd_2006)输出中,为了空间的原因,我删除了一些变量,因此 str_replace_all 下面的代码不匹配我在这里发布的输出(但他们在原始代码中):

Here is my code from after the import, beginning with removing the $ from the vector. In the str(dd_2006) output, I deleted some of the variables for the sake of space, so the column #s in the str_replace_all code below don't match the output I've posted here (but they do in the original code):

library("stringr")
dd_2006$SumOfCost <- str_sub(dd_2006$SumOfCost, 2, ) #2=the first # after the $

#Removes decimal pt, zero's after, and commas
dd_2006[ ,9] <- str_replace_all(dd_2006[ ,9], ".00", "")
dd_2006[,9] <- str_replace_all(dd_2006[,9], ",", "")

dd_2006[ ,10] <- str_replace_all(dd_2006[ ,10], ".00", "")
dd_2006[ ,10] <- str_replace_all(dd_2006[,10], ",", "")

dd_2006[ ,11] <- str_replace_all(dd_2006[ ,11], ".00", "")
dd_2006[,11] <- str_replace_all(dd_2006[,11], ",", "")

str(dd_2006)
'data.frame':   12604 obs. of  14 variables:
 $ CMHSP                     : Factor w/ 46 levels "Allegan","AuSable Valley",..: 1 1 1
 $ FY                        : Factor w/ 1 level "2006": 1 1 1 1 1 1 1 1 1 1 ...
 $ Population                : Factor w/ 1 level "DD": 1 1 1 1 1 1 1 1 1 1 ...
 $ SumOfCases                : chr  "0" "1" "0" "0" ...
 $ SumOfUnits                : chr  "0" "365" "0" "0" ...
 $ SumOfCost                 : chr  "0" "96416" "0" "0" ...

我发现了对类似问题的回复=http://stackoverflow.com/questions/2288485/how-to-convert-a-data-frame-column-to-numeric-type>此处,使用以下代码:

I found a response to a similar question to mine here, using the following code:

# create dummy data.frame
d <- data.frame(char = letters[1:5], 
                fake_char = as.character(1:5), 
                fac = factor(1:5), 
                char_fac = factor(letters[1:5]), 
                num = 1:5, stringsAsFactors = FALSE)

让我们一瞥data.frame

Let us have a glance at data.frame

> d
  char fake_char fac char_fac num
1    a         1   1        a   1
2    b         2   2        b   2
3    c         3   3        c   3
4    d         4   4        d   4
5    e         5   5        e   5

,让我们运行:

> sapply(d, mode)
       char   fake_char         fac    char_fac         num 
"character" "character"   "numeric"   "numeric"   "numeric" 
> sapply(d, class)
       char   fake_char         fac    char_fac         num 
"character" "character"    "factor"    "factor"   "integer" 

现在你可能会问自己:异常在哪里?好吧,我在R里碰到了很奇怪的东西,这不是最混乱的事情,但它可能会让你困惑,特别是如果你在滚动到床之前阅读这个。

Now you probably ask yourself "Where's an anomaly?" Well, I've bumped into quite peculiar things in R, and this is not the most confounding thing, but it can confuse you, especially if you read this before rolling into bed.

这里:前两列是字符。我故意叫第二个fake_char。点击这个字符变量的相似性,Dirk在他的回复中创建的。它实际上是一个数字向量转换为字符。第三列和第四列是因子,最后一个是纯粹数字。

Here goes: first two columns are character. I've deliberately called 2nd one fake_char. Spot the similarity of this character variable with one that Dirk created in his reply. It's actually a numerical vector converted to character. 3rd and 4th column are factor, and the last one is "purely" numeric.

如果使用transform函数,可以将fake_char转换为数字,变量本身。

If you utilize transform function, you can convert the fake_char into numeric, but not the char variable itself.

> transform(d, char = as.numeric(char))
  char fake_char fac char_fac num
1   NA         1   1        a   1
2   NA         2   2        b   2
3   NA         3   3        c   3
4   NA         4   4        d   4
5   NA         5   5        e   5
Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
but if you do same thing on fake_char and char_fac, you'll be lucky, and get away with no NA's:


$ b b


transform(d,fake_char = as.numeric(fake_char),
char_fac = as.numeric(char_fac))

transform(d, fake_char = as.numeric(fake_char), char_fac = as.numeric(char_fac))



  char fake_char fac char_fac num
1    a         1   1        1   1
2    b         2   2        2   2
3    c         3   3        3   3
4    d         4   4        4   4
5    e         5   5        5   5

所以我在我的脚本中尝试上面的代码,但仍然提出了NAs(没有关于强制的警告消息)。

So I tried the above code in my script, but still came up with NAs (without a warning message about coercion).

#changing sumofcases, cost, and units to numeric
dd_2006_1 <- transform(dd_2006, SumOfCases = as.numeric(SumOfCases), SumOfUnits = as.numeric(SumOfUnits), SumOfCost = as.numeric(SumOfCost))

> sum(is.na(dd_2006_1$SumOfCost))
[1] 12
> sum(is.na(dd_2006_1$SumOfCases))
[1] 7
> sum(is.na(dd_2006_1$SumOfUnits))
[1] 11



< ve也使用表(dd_2006 $ SumOfCases)等来查看观察结果,看看在观察中是否有任何字符,但没有任何字符。任何想法为什么NA​​s弹出,以及如何摆脱它们?

I've also used table(dd_2006$SumOfCases) etc. to look at the observations to see if there are any characters that I missed in the observations, but there weren't any. Any thoughts on why the NAs are popping up, and how to get rid of them?

推荐答案

正如Anando所指出的,问题是数据中的某处,我们不能真正帮助你,例。也就是说,以下是一段代码段,可帮助您确定数据中导致问题的记录:

As Anando pointed out, the problem is somewhere in your data, and we can't really help you much without a reproducible example. That said, here's a code snippet to help you pin down the records in your data that are causing you problems:

test = as.character(c(1,2,3,4,'M'))
v = as.numeric(test) # NAs intorduced by coercion
ix.na = is.na(v)
which(ix.na) # row index of our problem = 5
test[ix.na]  # shows the problematic record, "M"

不是猜测为什么引入NAs,请拉出导致问题的记录,并直接/单独寻址,直到NA消失。

Instead of guessing as to why NAs are being introduced, pull out the records that are causing the problem and address them directly/individually until the NAs go away.

UPDATE:看起来问题是在你调用 str_replace_all 。我不知道 stringr 库,但我想你可以完成同样的事情与 gsub 像这样: / p>

UPDATE: Looks like the problem is in your call to str_replace_all. I don't know the stringr library, but I think you can accomplish the same thing with gsub like this:

v2 = c("1.00","2.00","3.00")
gsub("\\.00", "", v2)

[1] "1" "2" "3"

我不完全确定这是做什么的:

I'm not entirely sure what this accomplishes though:

sum(as.numeric(v2)!=as.numeric(gsub("\\.00", "", v2))) # Illustrate that vectors are equivalent.

[1] 0

除非达到某些特定目的,我建议将此步骤从您的预处理完全删除,因为它似乎不需要,似乎给你的问题。

Unless this achieves some specific purpose for you, I'd suggest dropping this step from your preprocessing entirely, as it doesn't appear necessary and seems to be giving you problems.

这篇关于将字符转换为数字,不带NA在R中强制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆