如何在R中的非结构化数据框架内定位数据结构化区域? [英] How to locate a structured region of data inside of a not structured data frame in R?

查看:160
本文介绍了如何在R中的非结构化数据框架内定位数据结构化区域?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些包含感兴趣的子集的数据框。 问题是这个子集在不同的数据帧之间是不一致的。尽管如此,在更抽象的层次中,遵循一般结构:数据框架内的一个矩形区域。

  example1<  -  data.frame(x = c(name,129-2,NA,NA,acc,2,3,4,NA,NA) 
y = c(NA,NA,NA,NA,deb,3,2,5,NA,NA),
z = c(NA,NA,NA,NA,asset ,1,2,NA,NA))

print(example1)

xyz
1名称< NA> < NA>
2 129-2< NA> < NA>
3< NA> < NA> < NA>
4< NA> < NA> < NA>
5分配ACC DEB资产
6分配2 3 1
7分配3 2 1
8 4 5 2
9版; NA> < NA> < NA>
10< NA> < NA> < NA>

example1 包含一个具有结构信息的矩形区域

  5 acc deb资产
6 2 3 1
7 3 2 1
8 4 5 2

如前所述,该地区不是总是一致的


  1. 列的位置并不总是相同的



  2. $ b

    这里是另一个 example2

      example2<  -  data.frame(x = c (name,129-2,wallabe#23,NA,NA,acc,2,3,4,NA),
    y = c(NA,NA,NA,NA,余额,债务,3,2,5,NA),
    z = c(NA,NA,NA,NA,NA,资产,1,1,2,NA),
    u = c(NA,NA,NA,货币:,NA,NA,NA,NA,NA,NA),
    i = c(NA,NA,NA,USD,result ,2,3,1,NA),
    o = c(NA,NA,NA,NA,NA,输入,2,2,1,NA))

    print(example2)
    >示例2
    X YžüI O
    1名< NA> < NA> < NA> < NA> < NA>
    2 129-2< NA> < NA> < NA> < NA> < NA>
    3 wallabe#23< NA> < NA> < NA> < NA> < NA>
    4< NA> < NA> < NA>货币:USD< NA>
    5< NA>余额< NA> < NA>结果< NA>
    6 acc deb资产< NA>赢了
    7 2 3 1< NA> 2 2
    8 3 2 1< NA> 3 2
    9 4 5 2< NA> 1 1
    10< NA> < NA> < NA> < NA> < NA> < NA>

    example2 包含一个明确矩形区域

      6 ACC DEB资产< NA>赢了
    7 2 3 1< NA> 2 2
    8 3 2 1< NA> 3 2
    9 4 5 2< NA> 1 1

    扫描此数据框以查找其中的这种区域的一种方法?



    任何想法都赞赏

    解决方案

    想要尝试同样数量的 NA的最长序列 s:

      findTable<  -  function(df){
    naSeq< - rowSums(is.na(df))#每行
    myRle < - rle(naSeq)$ length#查找序列长度
    df [rep(myRle == max(myRle),myRle),]#获取最长序列
    }

    findTable(example1)
    xyz
    5 acc deb资产
    6 2 3 1
    7 3 2 1
    8 4 5 2

    findTable(example2)
    xyzuio
    6 acc资产< NA>赢了
    7 2 3 1< NA> 2 2
    8 3 2 1< NA> 3 2
    9 4 5 2< NA> 1个1


    I have a certain kind of data frames that contain a subset of interest. The problem is that this subset, is non consistent between the different data frames. Nonetheless, in a more abstract level, follows a general structure: a rectangular region inside the data frame.

    example1 <- data.frame(x = c("name", "129-2", NA, NA, "acc", 2, 3, 4, NA, NA), 
           y = c(NA, NA, NA, NA, "deb", 3, 2, 5, NA, NA),
           z = c(NA, NA, NA, NA, "asset", 1, 1, 2, NA, NA))
    
    print(example1)
    
          x    y     z
    1   name <NA>  <NA>
    2  129-2 <NA>  <NA>
    3   <NA> <NA>  <NA>
    4   <NA> <NA>  <NA>
    5    acc  deb asset
    6      2    3     1
    7      3    2     1
    8      4    5     2
    9   <NA> <NA>  <NA>
    10  <NA> <NA>  <NA>
    

    The example1 contain a clear rectangular región with a structure information:

    5    acc  deb asset
    6      2    3     1
    7      3    2     1
    8      4    5     2
    

    As mentioned before, the region is not always consistent,

    1. the position of the columns are not always the same
    2. the name of the variables insde the subset of interest are not always the same

    Here another example2:

    example2 <- data.frame(x = c("name", "129-2", "wallabe #23", NA, NA, "acc", 2, 3, 4, NA ), 
           y = c(NA, NA, NA, NA, "balance", "deb", 3, 2, 5, NA),
           z = c(NA, NA, NA, NA, NA, "asset", 1, 1, 2, NA),
           u = c(NA, NA, NA, "currency:", NA, NA, NA, NA, NA, NA),
           i = c(NA, NA, NA, "USD", "result", "win", 2, 3, 1, NA),
           o = c(NA, NA, NA, NA, NA, "lose", 2, 2, 1, NA))
    
    print(example2)
    > example2
                x       y     z         u      i    o
    1         name    <NA>  <NA>      <NA>   <NA> <NA>
    2        129-2    <NA>  <NA>      <NA>   <NA> <NA>
    3  wallabe #23    <NA>  <NA>      <NA>   <NA> <NA>
    4         <NA>    <NA>  <NA> currency:    USD <NA>
    5         <NA> balance  <NA>      <NA> result <NA>
    6          acc     deb asset      <NA>    win lose
    7            2       3     1      <NA>      2    2
    8            3       2     1      <NA>      3    2
    9            4       5     2      <NA>      1    1
    10        <NA>    <NA>  <NA>      <NA>   <NA> <NA>
    

    The example2 contain a not clear rectangular región:

    6          acc     deb asset      <NA>    win lose
    7            2       3     1      <NA>      2    2
    8            3       2     1      <NA>      3    2
    9            4       5     2      <NA>      1    1
    

    One method to scan this dataframe to locate this kind of region inside of it?

    Any idea is appreciated

    解决方案

    You might want to try the longest sequence with same amount of NAs:

    findTable <- function(df){
      naSeq <- rowSums(is.na(df))          # How many NA per row
      myRle <- rle(naSeq )$length          # Find sequences length
      df[rep(myRle == max(myRle), myRle),] # Get longest sequence
    }
    
    findTable(example1)
        x   y     z
    5 acc deb asset
    6   2   3     1
    7   3   2     1
    8   4   5     2
    
    findTable(example2)
        x   y     z    u   i    o
    6 acc deb asset <NA> win lose
    7   2   3     1 <NA>   2    2
    8   3   2     1 <NA>   3    2
    9   4   5     2 <NA>   1    1
    

    这篇关于如何在R中的非结构化数据框架内定位数据结构化区域?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆