识别data.frame中仅在R中具有NA值的行 [英] Identifying rows in data.frame with only NA values in R

查看:111
本文介绍了识别data.frame中仅在R中具有NA值的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.frame,其中有15,000个对34个序数和NA变量的观察.我正在为市场细分研究执行聚类,并且只需要删除NAs的行.取出userID之后,我收到一条错误消息,提示在集群之前仅使用NAs省略2099行.

I have a data.frame with 15,000 observations of 34 ordinal and NA variables. I am performing clustering for a market segmentation study and need the rows with only NAs removed. After taking out the userID I got an error message saying to omit 2099 rows with only NAs before clustering.

我找到了一个链接,用于删除具有所有NA值的行,但是我需要确定2099行中的哪些具有所有NA值.此处的讨论链接删除具有所有NA值的行:使用以下命令删除行data.frame中的NAs

I found a link for removing rows with all NA values, but I need to identify which of the 2099 rows have all NA values. Here the link for the discussion removing rows with all NA values: Remove Rows with NAs in data.frame

以下是来自六个变量的前五个观察值的样本:

Here's a sample of the first five observations from six variables:

> head(Store2df, n=5)
  RowNo      Age Gender HouseholdIncome MaritalStatus PresenceofChildren
1     1     <NA>   Male            <NA>          <NA>               <NA>
2     2    45-54 Female            <NA>          <NA>               <NA>
3     3     <NA>   <NA>            <NA>          <NA>               <NA>
4     4     <NA>   <NA>            <NA>          <NA>               <NA>
5     5    45-54 Female        75k-100k       Married                Yes
#Making a vector
> Vector1 <- Store2df$RowNo 
#Taking out RowNo column
> Store2df$RowNo <- NULL

编辑:我将结果放在一个对象中,但是发现代码增加了一个额外的列.在RStudio的环境中单击,将创建一个名为row.names的额外列,并用原始行名标记每一行.删除了两千行,新列用旧行号标记了新行.但是,当查看新对象的头部时,我没有看到行标签.为什么row.names标签在环境中显示,而在查看标题时却不显示?

I put the results in a object, but found that the code made an extra column. Clicking in RStudio's environment, an extra column called row.names was created labeling each row with the original row name. A couple thousand rows were deleted and the new column labeled the new rows with the old row number. However when looking at the head of the new object, I did not see the row label. Why does the row.names label show in the environment, but not when I view the head?

#Remove all rows with only NA values
> Store2df <- Store2[!!rowSums(!is.na(Store2)),]
#View head of store2df
> head(Store2df)
    Age Gender HouseholdIncome MaritalStatus PresenceofChildren
1  <NA>   Male            <NA>          <NA>               <NA>
2 45-54 Female            <NA>          <NA>               <NA>
5 45-54 Female        75k-100k       Married                Yes
6 25-34   Male        75k-100k       Married                 No
7 35-44 Female       125k-150k       Married                Yes
8 55-64   Male        75k-100k       Married                 No

我在行号/用户ID列中输入以跟踪用户数.为了执行删除所有NA的操作,我取出了第一列.现在,我需要跟踪我删除的用户.我有2000多个具有所有NA值的行的列表,我不想手动在每个行中创建索引.

EDIT 2: I put in the row number/userID column to keep track of the number of users. To perform the operation for removing all NAs, I took out the first column. Now I need to keep track of the users I removed. I have a list of over 2000 rows that had all NA values, I don't want to create an index manually putting in each row.

问题:如何删除丢失数据所对应的电子邮件?

Question: How do I remove the emails that the missing data corresponded to?

> #First six rows of the column RowNo
> head(Store2df$RowNo)
[1] 1 2 3 4 5 6

我希望在Store2df data.frame中删除2099行,其中包括RowNo.这是脚本,用于识别Store2df data.frame中没有RowNo的所有行都是空的.

I want 2099 rows deleted in the Store2df data.frame with the RowNo included. Here's the script identifying which rows are all empty in the Store2df data.frame without RowNo.

> which(rowSums(is.na(Store2df))==ncol(Store2df))

显示前6行的行3和4被删除.

Showing the first 6 rows, row number 3 and 4 are deleted.

> head(Store2df$RowNo)
[1] 1 2 5 6 7 8

我要完成4个步骤:

1)取出Store2df data.frame中的RowNo列,并保存为单独的向量

1) Take out RowNo column in Store2df data.frame and save as separate vector

2)删除Store2df data.frame

2) Delete rows with all NA values in Store2df data.frame

3)删除Store2new1 vector中与Store2df data.frame

3) Delete same rows in Store2new1 vector as Store2df data.frame

4)将vectordata.frame与匹配data.frame

推荐答案

 which(rowSums(is.na(Store2))==ncol(Store2))
 #3 4 
 #3 4 

 which(Reduce(`&`,as.data.frame(is.na(Store2))))
 #[1] 3 4

 which(!rowSums(!is.na(Store2)))  
 #3 4 
 #3 4 

数据

 Store2 <- structure(list(Age = c(NA, "45-54", NA, NA, "45-54"), Gender = c("Male", 
 "Female", NA, NA, "Female"), HouseholdIncome = c(NA, NA, NA, 
  NA, "75k-100k"), MaritalStatus = c(NA, NA, NA, NA, "Married"), 
PresenceofChildren = c(NA, NA, NA, NA, "Yes"), HomeOwnerStatus = c(NA, 
NA, NA, NA, "Own"), HomeMarketValue = c(NA, NA, NA, NA, "150k-200k"
)), .Names = c("Age", "Gender", "HouseholdIncome", "MaritalStatus", 
"PresenceofChildren", "HomeOwnerStatus", "HomeMarketValue"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))

更新

要删除所有NA s

  Store2[!!rowSums(!is.na(Store2)),]
  #   Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus
  #1  <NA>   Male            <NA>          <NA>               <NA>            <NA>
  #2 45-54 Female            <NA>          <NA>               <NA>            <NA>
  #5 45-54 Female        75k-100k       Married                Yes             Own
   #HomeMarketValue
  #1            <NA>
  #2            <NA>
  #5       150k-200k

  • is.na(Store2)给出缺少或NA
  • 的元素的逻辑索引
  • !将取消逻辑索引,即TRUE变为FALSE,反之亦然
  • 上面代码的
  • rowSums给出了每一行中not NA的元素之和

    • is.na(Store2) gives a logical index of elements that are missing or NA
    • ! will negate the logical index i.e. TRUE becomes FALSE and viceversa
    • rowSums of the above code gives the sum of elements that are not NA in each row

          rowSums(!is.na(Store2))
          #   1 2 3 4 5 
          #   1 2 0 0 7  # 3rd and 4th row have `0 non NA` values
      

    • !否定以上给出的结果

    • ! Negate the above gives

          !rowSums(!is.na(Store2))
          # 1     2     3     4     5 
          #FALSE FALSE  TRUE  TRUE FALSE 
      

    • 我们想删除那些all NA's0 non NAs的行.所以!再次

    • We wanted to drop those rows that are all NA's or 0 non NAs. So ! again

          !!rowSums(!is.na(Store2))
          #1     2     3     4     5 
          #TRUE  TRUE FALSE FALSE  TRUE 
      

    • 使用上述逻辑索引的子集

    • Subset using the above logical index

      如果有两个rowNo,即在删除NA行之前分别存储的一个,而在删除NA后存储第二个.

      If you have two rowNo, i.e. the one you stored separately before deleting the NA rows and the second after you deleted the NAs.

         RowNo1 <- 1:6
         RowNo2 <- c(1,2,5,6)
         RowNo1 %in% RowNo2
         #[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE
         RowNo1[RowNo1 %in% RowNo2]
         #[1] 1 2 5 6
      

      Update3

      对于您的新请求,让我再试一次:

      Update3

      With your new requests, let me try it again:

          Store2 <- structure(list(RowNo = 1:5, Age = c(NA, "45-54", NA, NA, "45-54"
          ), Gender = c("Male", "Female", NA, NA, "Female"), HouseholdIncome = c(NA, 
          NA, NA, NA, "75k-100k"), MaritalStatus = c(NA, NA, NA, NA, "Married"
         ), PresenceofChildren = c(NA, NA, NA, NA, "Yes")), .Names = c("RowNo", 
         "Age", "Gender", "HouseholdIncome", "MaritalStatus", "PresenceofChildren"
         ), class = "data.frame", row.names = c("1", "2", "3", "4", "5"
         ))
      

      第一步

      RowNo另存为单独的向量(我不确定为什么需要这样做)

      First step

      Saving RowNo as separate vector (I am not sure why you need this)

        Store2new1 <- Store2$RowNo
      

      第二步

      删除具有Store2 data.frame中所有NA值的行并将其存储为Store2df

      Second step

      Delete rows with all NA values in Store2 data.frame and store it as Store2df

         Store2df <- Store2[!!rowSums(!is.na(Store2[,-1])),] #Here you already get the new dataset with `RowNo` column
      
         Store2df
         #RowNo   Age Gender HouseholdIncome MaritalStatus PresenceofChildren
         #1     1  <NA>   Male            <NA>          <NA>               <NA>
         #2     2 45-54 Female            <NA>          <NA>               <NA>
         #5     5 45-54 Female        75k-100k       Married                Yes
      

      第三步

      在Store2new1向量中删除与Store2df data.frame相同的行

      Third step

      Delete same rows in Store2new1 vector as Store2df data.frame

         Store2new2 <- Store2new1[Store2new1 %in% Store2df$RowNo]
         Store2new1[Store2new1 %in% Store2df$RowNo]
         #[1] 1 2 5
      

      第四步

      除非您要删除更多行,否则我真的不认为需要第四步或第三步.

      Fourth step

      I don't really think the fourth step or third is required unless you want to delete more rows, which is not clear from the post.

      这篇关于识别data.frame中仅在R中具有NA值的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆