read.csv和fread对于同一数据帧产生不同的结果 [英] read.csv and fread produce different results for the same data frame
问题描述
fread
函数比 read.cvs
函数读取大型csv文件的速度更快.但是,正如您从数据帧的输出中看到的那样,对于"device _id"列,这两个例程都是不同的(请参见最后3位数字).为什么?这些函数中是否有一个参数可以正确读取它们?还是这是 fread
的正常行为?(尽管读取此数据文件的速度提高了10倍).
fread
function from data.table package reads large csv files faster than the read.cvs
function. But as you can see from the output of a data frame from both routines are different for the "device _id" column (see last 3 digits). Why? Is there a parameter in these functions to read them correctly? Or this is a normal behavior for fread
? (it reads this datafile 10x faster though).
# Read file
p<-fread("C:\\User\\Documents\\Data\\device.csv",sep=", integer64="character" )
> str(p)
Classes ‘data.table’ and 'data.frame': 187245 obs. of 3 variables:
$ device_id : Factor w/ 186716 levels "-1000025442746372936",..: 89025 96789 140102 123523 45208 118633 32423 22215 54410 81947 ...
$ phone_brand : Factor w/ 131 levels "E<U+4EBA>E<U+672C>""| __truncated__,"E<U+6D3E>""| __truncated__,..: 52 52 16 10 16 32 52 32 52 14 ...
$ device_model: Factor w/ 1598 levels "1100","1105",..: 1517 750 561 1503 537 775 753 433 759 983 ...
- attr(*, ".internal.selfref")=<externalptr>
> head(p)
device_id brand device_model
1: -8890648629457979026 <U+5C0F><U+7C73> <U+7EA2><U+7C73>
2: 1277779817574759137 <U+5C0F><U+7C73> MI 2
3: 5137427614288105724 <U+4E09><U+661F> Galaxy S4
4: 3669464369358936369 SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
5: -5019277647504317457 <U+4E09><U+661F> Galaxy Note 2
6: 3238009352149731868 <U+534E><U+4E3A> Mate
# Read file
p<-read.csv("C:\\Users\\Documents\\Data\\device.csv",sep=",")
# Convert device_id to character
> p$device_id<-as.character(p$device_id)
> str(p)
'data.frame': 187245 obs. of 3 variables:
$ device_id : chr "-8890648629457979392" "1277779817574759168" "5137427614288105472" "3669464369358936576" ...
$ phone_brand : chr "<U+5C0F><U+7C73>""| __truncated__ "<U+5C0F><U+7C73>""| __truncated__ "<U+4E09><U+661F>""| __truncated__ "SUGAR" ...
$ device_model: chr "<U+7EA2><U+7C73>""| __truncated__ "MI 2" "Galaxy S4" "<U+65F6><U+5C1A><U+624B><U+673A>""| __truncated__ ...
> head(p)
device_id brand device_model
1 -8890648629457979392 <U+5C0F><U+7C73> <U+7EA2><U+7C73>
2 1277779817574759168 <U+5C0F><U+7C73> MI 2
3 5137427614288105472 <U+4E09><U+661F> Galaxy S4
4 3669464369358936576 SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
5 -5019277647504317440 <U+4E09><U+661F> Galaxy Note 2
6 3238009352149731840 <U+534E><U+4E3A> Mate
推荐答案
像teger优雅地讨论了 read.csv
函数在读取64位数字方面有局限性.因此,与 fread
一样,如果将 numerals
参数定义为"no.loss",则 read.cvs
也可以使用.感谢所有对此问题的贡献者.
Like teger elegantly discussed the read.csv
function has a limitation in reading 64 bit numbers. So like fread
, if the numerals
argument is defined as "no.loss" read.cvs
also works. Thanks all the contributors to this question.
p<-read.csv("C:\\Users\\Documents\\Data\\device.csv",sep=",",encoding="UTF-8", numerals="no.loss" )
> head(p)
device_id phone_brand device_model
1: -8890648629457979026 <U+5C0F><U+7C73> <U+7EA2><U+7C73>
2: 1277779817574759137 <U+5C0F><U+7C73> MI 2
3: 5137427614288105724 <U+4E09><U+661F> Galaxy S4
4: 3669464369358936369 SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
5: -5019277647504317457 <U+4E09><U+661F> Galaxy Note 2
6: 3238009352149731868 <U+534E><U+4E3A> Mate
这篇关于read.csv和fread对于同一数据帧产生不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!