去除"NUL".字符(在R内) [英] Removing "NUL" characters (within R)
问题描述
我有一个奇怪的文本文件,其中包含一堆 但是,出于健壮性的考虑,我更喜欢将解决方案自动化的方法,理想情况下(出于组织的考虑)我可以在正在编写的R脚本的开头添加一些内容来清理文件. 这个问题看起来很有希望但是可接受的答案是不够的-每当我尝试在这些文件上使用 是否有任何方法可以将此文件的行放入R,以便我可以使用 您要将文件读取为二进制文件,则可以替换 您也可以用一个非常稀有的字符(例如 I've got a strange text file with a bunch of However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - Is there any way to get the lines of this file into R so I could use You want to read the file as binary then you can substitute the You could also substitute the
这篇关于去除"NUL".字符(在R内)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!NUL
字符(实际上大约有10个这样的文件),我想以编程方式从R中替换它们.此处是指向其中一个文件的链接.
借助于这个问题,我终于找到了一个优于 ad-进入每个文件并查找和替换令人讨厌的字符的一种特殊方法.事实证明,它们中的每对都应对应一个空格([NUL][NUL]
-> ),以保持文件的预期行宽(这对于在以后的路径中以固定宽度读取这些文件至关重要)./p>
readLines
时,它都会引发错误(除非我激活skipNul
).gsub
或其他任何方法来解决此问题而无需借助外部程序?NUL
,例如用空格代替它们:r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ ...
"\01"
)替换NUL
,然后就地处理字符串,例如,假设您要替换两个NUL
("\00\00"
),只有一个空格:r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__
NUL
characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files.
With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space ([NUL][NUL]
->) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).
readLines
throws an error whenever I try to use it on these files (unless I activate skipNul
).gsub
or whatever else to fix this issue without resorting to external programs?NUL
s, e.g. to replace them by spaces:r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ ...
NUL
s with a really rare character (such as "\01"
) and work on the string in place, e.g., let's say if you want to replace two NUL
s ("\00\00"
) with one space:r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__