去除"NUL".字符(在R内) [英] Removing "NUL" characters (within R)

查看:202
本文介绍了去除"NUL".字符(在R内)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个奇怪的文本文件,其中包含一堆NUL字符(实际上大约有10个这样的文件),我想以编程方式从R中替换它们.此处是指向其中一个文件的链接. 借助于这个问题,我终于找到了一个优于 ad-进入每个文件并查找和替换令人讨厌的字符的一种特殊方法.事实证明,它们中的每对都应对应一个空格([NUL][NUL]-> ),以保持文件的预期行宽(这对于在以后的路径中以固定宽度读取这些文件至关重要)./p>

但是,出于健壮性的考虑,我更喜欢将解决方案自动化的方法,理想情况下(出于组织的考虑)我可以在正在编写的R脚本的开头添加一些内容来清理文件. 这个问题看起来很有希望但是可接受的答案是不够的-每当我尝试在这些文件上使用readLines时,它都会引发错误(除非我激活skipNul).

是否有任何方法可以将此文件的行放入R,以便我可以使用gsub或其他任何方法来解决此问题而无需借助外部程序?

解决方案

您要将文件读取为二进制文件,则可以替换NUL,例如用空格代替它们:

r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
#  chr [1:155432] "000540952Anderson            Shelley J       FW1949     2000R000000000000119460007620            3  0007000704002097907KGKG1616"| __truncated__ ...

您也可以用一个非常稀有的字符(例如"\01")替换NUL,然后就地处理字符串,例如,假设您要替换两个NUL("\00\00" ),只有一个空格:

r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson            Shelley J       FW1949     2000R000000000000119460007620            3  0007000704002097907KGKG1616"| __truncated__

I've got a strange text file with a bunch of NUL characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files. With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space ([NUL][NUL]->) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).

However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - readLines throws an error whenever I try to use it on these files (unless I activate skipNul).

Is there any way to get the lines of this file into R so I could use gsub or whatever else to fix this issue without resorting to external programs?

解决方案

You want to read the file as binary then you can substitute the NULs, e.g. to replace them by spaces:

r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
#  chr [1:155432] "000540952Anderson            Shelley J       FW1949     2000R000000000000119460007620            3  0007000704002097907KGKG1616"| __truncated__ ...

You could also substitute the NULs with a really rare character (such as "\01") and work on the string in place, e.g., let's say if you want to replace two NULs ("\00\00") with one space:

r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson            Shelley J       FW1949     2000R000000000000119460007620            3  0007000704002097907KGKG1616"| __truncated__

这篇关于去除"NUL".字符(在R内)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆