如何在R中的非结构化数据框架内定位数据结构化区域? [英] How to locate a structured region of data inside of a not structured data frame in R?
问题描述
example1< - data.frame(x = c(name,129-2,NA,NA,acc,2,3,4,NA,NA)
y = c(NA,NA,NA,NA,deb,3,2,5,NA,NA),
z = c(NA,NA,NA,NA,asset ,1,2,NA,NA))
print(example1)
xyz
1名称< NA> < NA>
2 129-2< NA> < NA>
3< NA> < NA> < NA>
4< NA> < NA> < NA>
5分配ACC DEB资产
6分配2 3 1
7分配3 2 1
8 4 5 2
9版; NA> < NA> < NA>
10< NA> < NA> < NA>
example1
包含一个具有结构信息的矩形区域:
5 acc deb资产
6 2 3 1
7 3 2 1
8 4 5 2
如前所述,该地区不是总是一致的
- 列的位置并不总是相同的
- the position of the columns are not always the same
- the name of the variables insde the subset of interest are not always the same
$ b
这里是另一个 example2
:
example2< - data.frame(x = c (name,129-2,wallabe#23,NA,NA,acc,2,3,4,NA),
y = c(NA,NA,NA,NA,余额,债务,3,2,5,NA),
z = c(NA,NA,NA,NA,NA,资产,1,1,2,NA),
u = c(NA,NA,NA,货币:,NA,NA,NA,NA,NA,NA),
i = c(NA,NA,NA,USD,result ,2,3,1,NA),
o = c(NA,NA,NA,NA,NA,输入,2,2,1,NA))
print(example2)
>示例2
X YžüI O
1名< NA> < NA> < NA> < NA> < NA>
2 129-2< NA> < NA> < NA> < NA> < NA>
3 wallabe#23< NA> < NA> < NA> < NA> < NA>
4< NA> < NA> < NA>货币:USD< NA>
5< NA>余额< NA> < NA>结果< NA>
6 acc deb资产< NA>赢了
7 2 3 1< NA> 2 2
8 3 2 1< NA> 3 2
9 4 5 2< NA> 1 1
10< NA> < NA> < NA> < NA> < NA> < NA>
example2
包含一个明确矩形区域强>:
6 ACC DEB资产< NA>赢了
7 2 3 1< NA> 2 2
8 3 2 1< NA> 3 2
9 4 5 2< NA> 1 1
扫描此数据框以查找其中的这种区域的一种方法?
任何想法都赞赏
想要尝试同样数量的 NA的最长序列
s:
findTable< - function(df){
naSeq< - rowSums(is.na(df))#每行
myRle < - rle(naSeq)$ length#查找序列长度
df [rep(myRle == max(myRle),myRle),]#获取最长序列
}
findTable(example1)
xyz
5 acc deb资产
6 2 3 1
7 3 2 1
8 4 5 2
findTable(example2)
xyzuio
6 acc资产< NA>赢了
7 2 3 1< NA> 2 2
8 3 2 1< NA> 3 2
9 4 5 2< NA> 1个1
I have a certain kind of data frames that contain a subset of interest. The problem is that this subset, is non consistent between the different data frames. Nonetheless, in a more abstract level, follows a general structure: a rectangular region inside the data frame.
example1 <- data.frame(x = c("name", "129-2", NA, NA, "acc", 2, 3, 4, NA, NA),
y = c(NA, NA, NA, NA, "deb", 3, 2, 5, NA, NA),
z = c(NA, NA, NA, NA, "asset", 1, 1, 2, NA, NA))
print(example1)
x y z
1 name <NA> <NA>
2 129-2 <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> <NA>
5 acc deb asset
6 2 3 1
7 3 2 1
8 4 5 2
9 <NA> <NA> <NA>
10 <NA> <NA> <NA>
The example1
contain a clear rectangular región with a structure information:
5 acc deb asset
6 2 3 1
7 3 2 1
8 4 5 2
As mentioned before, the region is not always consistent,
Here another example2
:
example2 <- data.frame(x = c("name", "129-2", "wallabe #23", NA, NA, "acc", 2, 3, 4, NA ),
y = c(NA, NA, NA, NA, "balance", "deb", 3, 2, 5, NA),
z = c(NA, NA, NA, NA, NA, "asset", 1, 1, 2, NA),
u = c(NA, NA, NA, "currency:", NA, NA, NA, NA, NA, NA),
i = c(NA, NA, NA, "USD", "result", "win", 2, 3, 1, NA),
o = c(NA, NA, NA, NA, NA, "lose", 2, 2, 1, NA))
print(example2)
> example2
x y z u i o
1 name <NA> <NA> <NA> <NA> <NA>
2 129-2 <NA> <NA> <NA> <NA> <NA>
3 wallabe #23 <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA> <NA> currency: USD <NA>
5 <NA> balance <NA> <NA> result <NA>
6 acc deb asset <NA> win lose
7 2 3 1 <NA> 2 2
8 3 2 1 <NA> 3 2
9 4 5 2 <NA> 1 1
10 <NA> <NA> <NA> <NA> <NA> <NA>
The example2
contain a not clear rectangular región:
6 acc deb asset <NA> win lose
7 2 3 1 <NA> 2 2
8 3 2 1 <NA> 3 2
9 4 5 2 <NA> 1 1
One method to scan this dataframe to locate this kind of region inside of it?
Any idea is appreciated
You might want to try the longest sequence with same amount of NA
s:
findTable <- function(df){
naSeq <- rowSums(is.na(df)) # How many NA per row
myRle <- rle(naSeq )$length # Find sequences length
df[rep(myRle == max(myRle), myRle),] # Get longest sequence
}
findTable(example1)
x y z
5 acc deb asset
6 2 3 1
7 3 2 1
8 4 5 2
findTable(example2)
x y z u i o
6 acc deb asset <NA> win lose
7 2 3 1 <NA> 2 2
8 3 2 1 <NA> 3 2
9 4 5 2 <NA> 1 1
这篇关于如何在R中的非结构化数据框架内定位数据结构化区域?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!