读不正确的csv在R - 不匹配的报价 [英] reading badly formed csv in R - mismatched quotes

查看:212
本文介绍了读不正确的csv在R - 不匹配的报价的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有数百个大型CSV文件(大小从10k行到100k行每个),其中一些用引号引起的引号中形成的描述不完整,所以它们可能看起来像

I have hundreds of large CSV files (sizes vary from 10k lines to 100k lines in each) and some of them have badly formed descriptions with quotes within quotes so they might look something like

ID,Description,x
3434,"abc"def",988
2344,"fred",3484
2345,"fr""ed",3485
2346,"joe,fred",3486



需要能够清楚地解析R中的所有这些行作为CSV。dput()和读...

I need to be able to cleanly parse all of these lines in R as CSV. dput()'ing it and reading ...

txt <- c("ID,Description,x",
    "3434,\"abc\"def\",988",
    "2344,\"fred\",3484", 
    "2345,\"fr\"\"ed\",3485",
    "2346,\"joe,fred\",3486")

read.csv(text=txt[1:4], colClasses='character')
    Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
      incomplete final line found by readTableHeader on 'text'

并且不包括带有嵌入式逗号的最后一行 -

If we change the quoting and do not include the last line with the embedded comma - it works well

read.csv(text=txt[1:4], colClasses='character', quote='')

引用并包含带有逗号的最后一行...

However, if we change the quoting and include the last line with the embedded comma...

read.csv(text=txt[1:5], colClasses='character', quote='')
    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
      line 1 did not have 4 elements

编辑x2:应该说,不幸的是一些描述中有逗号 -

EDIT x2: Should have said that unfortunately some of the descriptions have commas in them - code is edited above.

推荐答案

更改报价设置:

read.csv(text=txt, colClasses='character',quote = "")

    ID Description    x
1 3434   "abc"def"  988
2 2344      "fred" 3484
3 2345    "fr""ed" 3485
4 2346       "joe" 3486



编辑以处理错误的逗号:



Edit to deal with errant commas:

  txt <- c("ID,Description,x",
         "3434,\"abc\"def\",988",
         "2344,\"fred\",3484", 
         "2345,\"fr\"\"ed\",3485",
         "2346,\"joe,fred\",3486")

txt2 <- readLines(textConnection(txt)) 

txt2 <- strsplit(txt2,",")

txt2 <- lapply(txt2,function(x) c(x[1],paste(x[2:(length(x)-1)],collapse=","),x[length(x)]) )
m <- do.call("rbind",txt2)
df <- as.data.frame(m,stringsAsFactors = FALSE)
names(df) <- df[1,]
df <- df[-1,]

#     ID Description    x
# 2 3434   "abc"def"  988
# 3 2344      "fred" 3484
# 4 2345    "fr""ed" 3485
# 5 2346  "joe,fred" 3486

不知道,如果这对你的用例是足够有效的。

No idea, if that is sufficiently efficient for your use case.

这篇关于读不正确的csv在R - 不匹配的报价的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆