当使用fread()导入大型CSV(8 GB)时,“嵌入字符串” [英] 'Embedded nul in string' when importing large CSV (8 GB) with fread()
问题描述
我有一个大的CSV文件(8.1 GB),我试图移动到R.我创建了CSV使用Python的csvkit in2csv
,从.txt文件,但不知何故转换导致空字符显示在文件中。导入时出现此错误:
I have a large CSV file (8.1 GB) that I'm trying to wrangle into R. I created the CSV using Python's csvkit in2csv
, converted from a .txt file, but somehow the conversion led to null characters showing up in the file. I'm now getting this error when importing:
fread(file.csv,nrows = 100)中出错:
在字符串中嵌入nul:'ÿþr\0e\0c\0d\0_\0z\0i\0p\0c\0'
我 虽然可以导入小块,只要 read.csv
,但这是因为它允许UTF-通过 fileEncoding
参数进行编码。
I am able to import small chunks just fine with read.csv
though, but that's because it allows for UTF-16 encoding via the fileEncoding
argument.
test <- read.csv("file.csv", nrows=100, fileEncoding="UTF-16LE")
I不要尝试导入 read.csv
的8 GB文件。
I don't dare try to import an 8 GB file with read.csv
, though.
所以我试过该解决方案在此处提供了你使用 sed s / \\0 // g file.csv> file2.csv
以拉出null。该命令执行得很好,填充了一个新的8GB CSV文件,但我收到一个几乎相同的错误:
So I then tried the solution offered here, in which you use sed s/\\0//g file.csv > file2.csv
to pull the nulls out. The command performed just fine and populated a new 8GB CSV file, but I received a nearly-identical error:
错误fread csv,nrows = 100):
embedded nul in string:'ÿþr\0e\0c\0d\0_\0z\0i\00\0c\0,\0p \0o\0s\0t\0_\0z\0i
所以,这不行。我在这一点上陷入困境。考虑到文件的大小,我不能使用 read.csv
在整个事情,我不知道如何摆脱原始CSV 。我甚至不知道如何文件编码为UTF-16。
So, that didn't work. I'm stumped at this point. Considering the size of the file, I can't use read.csv
on the whole thing, and I'm not sure how to get rid of the nulls in the original CSV. I'm not even sure how the file got encoded as UTF-16. Any suggestions or advice would be greatly appreciated at this point.
编辑:我在Windows机器上。
I'm on a Windows machine.
推荐答案
如果您使用的是linux / mac,请尝试此
If you're on linux/mac, try this
file <- "file.csv"
tt <- tempfile() # or tempfile(tmpdir="/dev/shm")
system(paste0("tr < ", file, " -d '\\000' >", tt))
fread(tt)
这篇关于当使用fread()导入大型CSV(8 GB)时,“嵌入字符串”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!