当使用fread()导入大型CSV(8 GB)时,“嵌入字符串” [英] 'Embedded nul in string' when importing large CSV (8 GB) with fread()

查看:1046
本文介绍了当使用fread()导入大型CSV(8 GB)时,“嵌入字符串”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的CSV文件(8.1 GB),我试图移动到R.我创建了CSV使用Python的csvkit in2csv ,从.txt文件,但不知何故转换导致空字符显示在文件中。导入时出现此错误:

I have a large CSV file (8.1 GB) that I'm trying to wrangle into R. I created the CSV using Python's csvkit in2csv, converted from a .txt file, but somehow the conversion led to null characters showing up in the file. I'm now getting this error when importing:

fread(file.csv,nrows = 100)中出错:
在字符串中嵌入nul:'ÿþr\0e\0c\0d\0_\0z\0i\0p\0c\0'

虽然可以导入小块,只要 read.csv ,但这是因为它允许UTF-通过 fileEncoding 参数进行编码。

I am able to import small chunks just fine with read.csv though, but that's because it allows for UTF-16 encoding via the fileEncoding argument.

test <- read.csv("file.csv", nrows=100, fileEncoding="UTF-16LE")

I不要尝试导入 read.csv 的8 GB文件。

I don't dare try to import an 8 GB file with read.csv, though.

所以我试过该解决方案在此处提供了你使用 sed s / \\0 // g file.csv> file2.csv 以拉出null。该命令执行得很好,填充了一个新的8GB CSV文件,但我收到一个几乎相同的错误:

So I then tried the solution offered here, in which you use sed s/\\0//g file.csv > file2.csv to pull the nulls out. The command performed just fine and populated a new 8GB CSV file, but I received a nearly-identical error:

错误fread csv,nrows = 100):
embedded nul in string:'ÿþr\0e\0c\0d\0_\0z\0i\00\0c\0,\0p \0o\0s\0t\0_\0z\0i

所以,这不行。我在这一点上陷入困​​境。考虑到文件的大小,我不能使用 read.csv 在整个事情,我不知道如何摆脱原始CSV 。我甚至不知道如何文件编码为UTF-16。

So, that didn't work. I'm stumped at this point. Considering the size of the file, I can't use read.csv on the whole thing, and I'm not sure how to get rid of the nulls in the original CSV. I'm not even sure how the file got encoded as UTF-16. Any suggestions or advice would be greatly appreciated at this point.

编辑:我在Windows机器上。

I'm on a Windows machine.

推荐答案

如果您使用的是linux / mac,请尝试此

If you're on linux/mac, try this

file <- "file.csv"
tt <- tempfile()  # or tempfile(tmpdir="/dev/shm")
system(paste0("tr < ", file, " -d '\\000' >", tt))
fread(tt)

这篇关于当使用fread()导入大型CSV(8 GB)时,“嵌入字符串”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆