筛选器导致R中的数据丢失 [英] Filter causes data missing in R

查看:100
本文介绍了筛选器导致R中的数据丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在R中使用filter来过滤具有所选国家/地区代码的所有行,而从1950年到2014年连续连续年的数据就像

Hi I want to use filter in R to filter all the row with selected countrycode, and the data with continuous year from 1950 to 2014 is like

  countrycode       country currency_unit year   rgdpe   rgdpo      pop      emp      avh
1         USA United States     US Dollar 1950 2279787 2274197 155.5635 62.83500 1983.738
2         USA United States     US Dollar 1951 2440076 2443820 158.2269 65.08094 2024.002
3         USA United States     US Dollar 1952 2530524 2526412 160.9597 65.85582 2020.183
4         USA United States     US Dollar 1953 2655277 2642977 163.6476 66.78711 2014.500
5         USA United States     US Dollar 1954 2640868 2633803 166.5511 65.59514 1991.019
6         USA United States     US Dollar 1955 2844098 2834914 169.5189 67.53133 1997.761

我的代码是:

dat_10 <- filter(data_all_country,countrycode == c("USA","CHN","GBR","IND","JPN","BRA","ZAF","FRA","DEU","ARG"))

令人惊奇的是 dat_10 如下:

  countrycode   country  currency_unit year     rgdpe     rgdpo      pop       emp
1         ARG Argentina Argentine Peso 1954  51117.46  51031.80 18.58889  6.970472
2         ARG Argentina Argentine Peso 1964  69836.62  68879.08 21.95909  7.962999
3         ARG Argentina Argentine Peso 1974 113038.73 110358.46 25.64450  9.135211
4         ARG Argentina Argentine Peso 1984 148994.61 149928.59 29.92091 10.345933
5         ARG Argentina Argentine Peso 1994 379470.19 372903.00 34.55811 12.075872
6         ARG Argentina Argentine Peso 2004 517308.94 499958.94 38.72878 14.669195

因为即使有效的时间序列数据也会每10年过滤一次,这也是我选择作为逻辑变量的国家的确切数字。

as even the valid time-series data is filtered every 10 years, which is the exact number of the country I select as logical variable.

这是怎么发生的,有什么方法可以解决?

How does this happen and any methods to fix it up ?

推荐答案

为什么要使用%in%not ==?

让我们更详细地了解 == %in%之间的区别。

Let's look at the difference between == and %in% in more details.

假设我们有一个矢量像这样。

Assuming that we have a vector looks like this.

sample_vec <- c("USA", "CHN", "GBR", "IND", "JPN", "BRA", "USA", "CHN", "GBR")

我们将返回所有 USA CHN GBR 。所需的输出是这样的,这对于子设置或过滤很有用。

And we what to return all USA, CHN, and GBR in the vector. The desired output is like this, which would be useful for subsetting or filtering.

#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

如果我们使用 == 使用 c( USA, CHN, GBR),我们可以获得以下内容。

If we use == with c("USA", "CHN", "GBR"), we can get the following.

sample_vec == c("USA", "CHN", "GBR")
#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

看起来不错吗?等一下,它没有按照我们的想法做。

Looks good? Wait, it is not doing what we think.

让我们用原始矢量的另一个新国家代码来测试此代码。

Let's test this code with one additional new country code to the original vector.

# Add one more country
sample_vec2 <- c(sample_vec, "IND")
sample_vec2 ==  c("USA", "CHN", "GBR")
#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE




警告消息:在sample_vec2 == c( USA, CHN, GBR)中:较长的
对象长度不是较短的对象长度的倍数

Warning message: In sample_vec2 == c("USA", "CHN", "GBR") : longer object length is not a multiple of shorter object length

结果可能看起来不错,但请注意警告消息。事实证明,当使用 == 比较两个向量时,R 短元素循环到长元素。上面的代码正在执行以下操作。每对字符分别进行评估。

The result may look good, but pay attention to the warning message. It turns out that when using == to compare two vectors, R recycles the short element to the long one. The above code is doing something as follows. Each pair of character is evaluated separately.

Position  1     2     3     4     5     6     7     8     9    10 
Vector1 "USA" "CHN" "GBR" "IND" "JPN" "BRA" "USA" "CHN" "GBR" "IND" 
Vector2 "USA" "CHN" "GBR" "USA" "CHN" "GBR" "USA" "CHN" "GBR" "USA"
Result   TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE

R在位置上从 Vector1 Vector2 计算字符串(如果相同)。如果它们相同,则返回 TRUE ,否则返回 FALSE ,然后移至 Position 2 ,依此类推。这就是为什么会有警告消息的原因。 sample_vec2 的长度为10,而目标向量的长度仅为3。因此,R需要回收目标向量中的元素以进行一对一比较

R evaluates the string from Vector1 and Vector2 on Position 1 if they are the same. If they are the same, returns TRUE, otherwise returns FALSE, and then move to Position 2, and so on. This is why there is a warning message. The length of sample_vec2 is 10, while the length of the target vector is only 3. R thus needs to recycle the elements in the target vector to perform one-to-one comparison.

现在,如果我们意识到在使用 == 时R正在进行回收和一对一比较,很明显,如果要过滤向量中的元素,则不适合。让我们看下面的示例。

Now if we realized that R is doing recycle and one-to-one comparison when we use ==, it is clear that it if we want to filter element in a vector, it is not suitable. Let's see the following example.

sample_vec == c("CHN", "GBR", "USA")
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

代码与以下代码几乎相同 sample_vec == c(美国, CHN, GBR),只是我更改了国家代码的顺序。但是它返回所有 FALSE !这是因为回收和一对一比较发现任何位置都不相同。这可能不是我们想要的结果。

The code is almost the same as sample_vec == c("USA", "CHN", "GBR"), except that I changed the order of the country code. But it returns all FALSE! This is because recycling and one-to-one comparison found none of any positions are the same. This is probably not the results we want.

但是,如果我们使用以下代码。

However, if we use the following code.

sample_vec %in% c("CHN", "GBR", "USA")
#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

它返回预期结果。这是因为%in%是R中 match 函数的接口。它返回 TRUE FALSE (如果存在)。

It returns the expected results. This is because %in% is an interface of the match function in R. It returns TRUE or FALSE if matches exist or not.

这篇关于筛选器导致R中的数据丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆