在R中读取英格兰和威尔士慈善委员会的bcp文件 [英] Reading England and Wales Charity Commission bcp files in R

查看:82
本文介绍了在R中读取英格兰和威尔士慈善委员会的bcp文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取R中的https://register-of-charities.charitycommission.gov.uk/register/full-register-download .我一直在尝试在此之前回答的问题,但readChar似乎无法读取所有文件中的所有内容,即会中断 extract_charity.bcp .

I'm trying to read .bcp files provided by https://register-of-charities.charitycommission.gov.uk/register/full-register-download in R. I have been trying previously answered questions here, but readChar does not seem to read everything in all files, namely it breaks for extract_charity.bcp.

所以我想到了readBin并尝试像这样读取 extract_charity.bcp :

So I have thought of readBin and tried to read extract_charity.bcp like this:

library(stringr)

b <- readBin("extract_charity.bcp", "character", n = 300000, size = NA_integer_,
             endian = .Platform$endian)

c<- paste0(b, collapse = "" ) #put it back as one large character string

d<- str_locate_all(c, "\\*\\@\\@\\*\\d") #find row breaks followed by a digit

e <- d[[1]]

flags <- e[,1]

f <- c()

f[1] <- substr(c, 1, flags[1]-1)

for (i in 2:length(flags)) f[i]<- substr(c, flags[i-1]+4, flags[i]-1) #removes row breaks

export <- matrix(nrow = 372432, ncol = 18)
exportF <- matrix(nrow = 0, ncol = 18)

for (j in 1:length(flags)) {
  new_row <- str_split( f[j], "\\@\\*\\*\\@" )[[1]] #removes column breaks
  if (length(new_row)==18) { export[j, ] <- new_row #if correct number of columns
  } else {  print(j)
            exportF <- rbind(exportF, new_row) }}

但是,有49个错误-都属于同一类型.在表格的不同位置插入了一个奇怪的字符串-当前为"P`j [Ÿ".但是当我再次运行该脚本时,它是°Tj [Ÿ",因此每次运行该脚本时它都会提供不同的字符串,因此我无法运行该脚本以手动将其删除:

However, there are 49 errors - all of the same type. There is a strange character string inserted at various places across the table - currently it is "P`j[Ÿ " but when I run the script again, it is "°Tj[Ÿ ", so it provides different string every time I run the script, so I cannot run the script to remove it manually:

str_replace_all(c, problem, "") 

Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement),  : 
  Missing closing bracket on a bracket expression. (U_REGEX_MISSING_CLOSE_BRACKET)

推荐答案

只是让全世界知道,这是可以做到的.在第一遍中,将分析文件并将问题存储在exportF中,在该文件中将其识别并从原始分析输出中删除.然后在第二遍中,将其正确解析.

Just to let the world know, this can be done. In first pass, the file is parsed and problems are stored in exportF, where it is identified and deleted from original parsing output. Then in the second pass, that is parsed correctly.

这是一个烂摊子,但它的工作原理也非常快.

This is a mess, but it works, and pretty fast, too.

library(stringr)
library(stringi)


b <- readBin("extract_charity.bcp", "character", n = 300000, size = NA_integer_)

c<- paste0(b, collapse = "" )

tt<- str_locate_all(c, "\\*\\@\\@\\*\\d")

e <- tt[[1]]

flags <- e[,1]

f <- c()

f[1] <- substr(c, 1, flags[1]-1)

for (i in 2:length(flags)) {
  f[i]<- substr(c, flags[i-1]+4, flags[i]-1)
}



export <- matrix(nrow = length(flags), ncol = 18)
exportF <- matrix(nrow = 0, ncol = 18)


for (j in 1:length(flags)) {
  new_row <- str_split( f[j], "\\@\\*\\*\\@" )[[1]]
  if (length(new_row)==18) { export[j, ] <- new_row
  } else {print(flags[j])
      exportF <- rbind(exportF, new_row) }}


#go trough the first line and see where the problem is and locate its position 
problem <- str_sub(as.character(exportF[1,8]), 5, 10)

#CHECK TO SEE IF CORRECT
problem %in% str_sub(exportF[1,8], 5, 10)

problem %in% exportF[1,8]

str_detect(c,problem )

str_detect(b[324],problem )



#d <-stri_replace_all_charclass(b, problem, "") 
str_detect(d,problem )

r<- gsub(problem, "", b )

str_detect(r,problem )

#now go again but with clean data

r<- paste0(r, collapse = "" )
tt<- str_locate_all(r, "\\*\\@\\@\\*\\d")

e <- tt[[1]]

flags <- e[,1]

f <- c()


f[1] <- substr(r, 1, flags[1]-1)

for (i in 2:length(flags)) {
  f[i]<- substr(r, flags[i-1]+4, flags[i]-1)
}

#g<- str_split(f[372432], "\\@\\*\\*\\@")[[1]]

export <- matrix(nrow = 372434, ncol = 18)
exportF <- matrix(nrow = 0, ncol = 18)


for (j in 1:length(flags)) {
  new_row <- str_split( f[j], "\\@\\*\\*\\@" )[[1]]
  if (length(new_row)==18) { export[j, ] <- new_row
  } else {print(flags[j])
    exportF <- rbind(exportF, new_row) }}









write.csv(export, "extract_charity2021.csv", row.names = F)

将它留在这里供将来的我自己或需要这样做的人使用.

leaving it here for either future myself or someone in need of doing this.

这篇关于在R中读取英格兰和威尔士慈善委员会的bcp文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆