过滤器导致 R 中的数据丢失 [英] Filter causes data missing in R

查看:20
本文介绍了过滤器导致 R 中的数据丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在R中使用过滤器来过滤所有选定国家代码的行,连续年份从1950年到2014年的数据就像

 countrycode country currency_unit ye​​ar rgdpe rgdpo pop emp avh1 美国 美元 1950 2279787 2274197 155.5635 62.83500 1983.7382 美国 美元 1951 2440076 2443820 158.2269 65.08094 2024.0023 美国 美元 1952 2530524 2526412 160.9597 65.85582 2020.1834 美国 美元 1953 2655277 2642977 163.6476 66.78711 2014.5005 美国 美元 1954 2640868 2633803 166.5511 65.59514 1991.0196 美国 美元 1955 2844098 2834914 169.5189 67.53133 1997.761

我的代码是:

dat_10 <- filter(data_all_country,countrycode == c("USA","CHN","GBR","IND","JPN","BRA","ZAF","FRA","DEU","ARG"))

令人惊奇的是 dat_10 如下:

 countrycode country currency_unit 年份 rgdpe rgdpo pop emp1 ARG 阿根廷阿根廷比索 1954 51117.46 51031.80 18.58889 6.9704722 ARG 阿根廷阿根廷比索 1964 69836.62 68879.08 21.95909 7.9629993 ARG 阿根廷阿根廷比索 1974 113038.73 110358.46 25.64450 9.1352114 ARG 阿根廷 阿根廷比索 1984 148994.61 149928.59 29.92091 10.3459335 ARG 阿根廷阿根廷比索 1994 379470.19 372903.00 34.55811 12.0758726 ARG 阿根廷 阿根廷比索 2004 517308.94 499958.94 38.72878 14.669195

因为即使是有效的时间序列数据也会每 10 年过滤一次,这正是我选择作为逻辑变量的国家/地区的确切数字.

这是怎么发生的,有什么方法可以解决吗?

解决方案

为什么我们应该使用 %in% 而不是 == ?

让我们更详细地看看 ==%in% 之间的区别.

假设我们有一个像这样的向量.

sample_vec <- c("USA", "CHN", "GBR", "IND", "JPN", "BRA", "USA", "CHN", "GBR")

然后我们返回向量中的所有USACHNGBR.所需的输出是这样的,这对子集或过滤很有用.

#[1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE

如果我们使用 ==c("USA", "CHN", "GBR"),我们可以得到以下结果.

sample_vec == c("USA", "CHN", "GBR")#[1] 真真假假假真真真假

好看吗?等等,它并没有按照我们的想法行事.

让我们在原始向量的基础上添加一个新的国家/地区代码来测试此代码.

# 再添加一个国家sample_vec2 <- c(sample_vec, "IND")sample_vec2 == c("USA", "CHN", "GBR")#[1] 真真假假假真真假假

<块引用>

警告信息:在 sample_vec2 == c("USA", "CHN", "GBR") 中:更长对象长度不是较短对象长度的倍数

结果可能看起来不错,但请注意警告消息.事实证明,当使用 == 比较两个向量时,R 回收短元素为长元素.上面的代码做的事情如下.每对字符单独求值.

位置 1 2 3 4 5 6 7 8 9 10Vector1 "USA" "CHN" "GBR" "IND" "JPN" "BRA" "USA" "CHN" "GBR" "IND"Vector2 "USA" "CHN" "GBR" "USA" "CHN" "GBR" "USA" "CHN" "GBR" "USA"结果 TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE

R 计算 Vector1Vector2Position 1 上的字符串(如果它们相同).如果相同则返回TRUE,否则返回FALSE,然后移动到Position 2,以此类推.这就是为什么会出现警告消息.sample_vec2的长度为10,而目标向量的长度只有3,因此R需要回收目标向量中的元素进行一一比较.

现在如果我们在使用==时意识到R在做recycle和一对一的比较,很明显,如果我们要过滤向量中的元素,它不是合适的.让我们看看下面的例子.

sample_vec == c("CHN", "GBR", "USA")#[1] 假假假假假假假假假假

代码与sample_vec == c("USA", "CHN", "GBR") 几乎一样,只是我改变了国家/地区代码的顺序.但它返回所有FALSE!这是因为回收和一对一比较发现没有任何位置是相同的.这可能不是我们想要的结果.

但是,如果我们使用以下代码.

sample_vec %in% c("CHN", "GBR", "USA")#[1] 真真假假假真真真假

它返回预期的结果.这是因为 %in% 是 R 中 match 函数的接口.如果它返回 TRUEFALSE匹配是否存在.

Hi I want to use filter in R to filter all the row with selected countrycode, and the data with continuous year from 1950 to 2014 is like

  countrycode       country currency_unit year   rgdpe   rgdpo      pop      emp      avh
1         USA United States     US Dollar 1950 2279787 2274197 155.5635 62.83500 1983.738
2         USA United States     US Dollar 1951 2440076 2443820 158.2269 65.08094 2024.002
3         USA United States     US Dollar 1952 2530524 2526412 160.9597 65.85582 2020.183
4         USA United States     US Dollar 1953 2655277 2642977 163.6476 66.78711 2014.500
5         USA United States     US Dollar 1954 2640868 2633803 166.5511 65.59514 1991.019
6         USA United States     US Dollar 1955 2844098 2834914 169.5189 67.53133 1997.761

And my code is :

dat_10 <- filter(data_all_country,countrycode == c("USA","CHN","GBR","IND","JPN","BRA","ZAF","FRA","DEU","ARG"))

The amazing thing is the dat_10 is as the following:

  countrycode   country  currency_unit year     rgdpe     rgdpo      pop       emp
1         ARG Argentina Argentine Peso 1954  51117.46  51031.80 18.58889  6.970472
2         ARG Argentina Argentine Peso 1964  69836.62  68879.08 21.95909  7.962999
3         ARG Argentina Argentine Peso 1974 113038.73 110358.46 25.64450  9.135211
4         ARG Argentina Argentine Peso 1984 148994.61 149928.59 29.92091 10.345933
5         ARG Argentina Argentine Peso 1994 379470.19 372903.00 34.55811 12.075872
6         ARG Argentina Argentine Peso 2004 517308.94 499958.94 38.72878 14.669195

as even the valid time-series data is filtered every 10 years, which is the exact number of the country I select as logical variable.

How does this happen and any methods to fix it up ?

解决方案

Why Should We Use %in% not == ?

Let's look at the difference between == and %in% in more details.

Assuming that we have a vector looks like this.

sample_vec <- c("USA", "CHN", "GBR", "IND", "JPN", "BRA", "USA", "CHN", "GBR")

And we what to return all USA, CHN, and GBR in the vector. The desired output is like this, which would be useful for subsetting or filtering.

#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

If we use == with c("USA", "CHN", "GBR"), we can get the following.

sample_vec == c("USA", "CHN", "GBR")
#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

Looks good? Wait, it is not doing what we think.

Let's test this code with one additional new country code to the original vector.

# Add one more country
sample_vec2 <- c(sample_vec, "IND")
sample_vec2 ==  c("USA", "CHN", "GBR")
#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE

Warning message: In sample_vec2 == c("USA", "CHN", "GBR") : longer object length is not a multiple of shorter object length

The result may look good, but pay attention to the warning message. It turns out that when using == to compare two vectors, R recycles the short element to the long one. The above code is doing something as follows. Each pair of character is evaluated separately.

Position  1     2     3     4     5     6     7     8     9    10 
Vector1 "USA" "CHN" "GBR" "IND" "JPN" "BRA" "USA" "CHN" "GBR" "IND" 
Vector2 "USA" "CHN" "GBR" "USA" "CHN" "GBR" "USA" "CHN" "GBR" "USA"
Result   TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE

R evaluates the string from Vector1 and Vector2 on Position 1 if they are the same. If they are the same, returns TRUE, otherwise returns FALSE, and then move to Position 2, and so on. This is why there is a warning message. The length of sample_vec2 is 10, while the length of the target vector is only 3. R thus needs to recycle the elements in the target vector to perform one-to-one comparison.

Now if we realized that R is doing recycle and one-to-one comparison when we use ==, it is clear that it if we want to filter element in a vector, it is not suitable. Let's see the following example.

sample_vec == c("CHN", "GBR", "USA")
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

The code is almost the same as sample_vec == c("USA", "CHN", "GBR"), except that I changed the order of the country code. But it returns all FALSE! This is because recycling and one-to-one comparison found none of any positions are the same. This is probably not the results we want.

However, if we use the following code.

sample_vec %in% c("CHN", "GBR", "USA")
#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

It returns the expected results. This is because %in% is an interface of the match function in R. It returns TRUE or FALSE if matches exist or not.

这篇关于过滤器导致 R 中的数据丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆