报价读取数据到R [英] Quotation issues reading data into R

查看:227
本文介绍了报价读取数据到R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些资料,我尝试载入 R 。它在.csv文件中,我可以查看Excel和OpenOffice中的数据。 (如果您好奇,则是来自加拿大选举数据的2011年投票结果数据此处)。



数据以不寻常​​的方式编码。典型的行是:

  12002,Central Nova,Nova-Center,1 N,N,1,299,Chisholm,,Matthew,Green Party,Parti Vert,N,N,11 

在Central-Nova的结尾有一个,但不是在开头。所以为了读入数据,我压制了引号,这对前几个文件工作正常。即

  test <-read.csv(pollresults_resultatsbureau11001.csv,header = TRUE,sep =,,fileEncoding =latin1,as.is = TRUE,quote =)

:在另一个文件(例如pollresults_resultatsbureau12002.csv)中,存在如下的一行数据:

  12002,Central Nova ,Nova-Center,6-1,Pictou,Subd。 A,N,N,0,168,帕克,大卫K,新民主党新民主党,新民主民主党民主党,N,N,28 


因为我需要禁止引号,所以Pictou,Subd。 A 使 R 希望将其拆分为2个变量。由于要通过构建数据框中途添加列,因此无法读取数据。



Excel和OpenOffice都可以打开这些文件没有问题。不知为什么,Excel和OpenOffice知道引号只有在变量条目开始时才有效<



您知道我需要在 R 上启用哪个选项才能获取此数据?我需要> 300个文件加载(每个都有〜1000行),所以手动修复不是一个选项...



我已经查看了所有的地方为解决方案,但不能找到一个。

解决方案

根据我的意见,这里有一个解决方案,文件到单个列表中。

 #正确处理法语
选项(encoding =latin1)

#将您的工作目录设置为
#解压所有308个CSV文件
setwd(path / to / unzipped / files)

#获取文件名
temp< - list.files()

#提取我们可以用作名称的5位代码
Codes< - gsub(读取所有文件到名为pollResults的单个列表
pollResults< - lapply(seq_along(temp),function(x) ){
T0 <-readLines(temp [x])
T0 [-1] < - gsub('^(。{6})(。*)$','\\\ \\ 1 \\\\2',T0 [-1])$ ​​b $ b final< - read.csv(text = T0,header = TRUE)
final
} )
names(pollResults)< - Codes

不同的方式。如果你只想看到第90个 data.frame ,你可以使用 pollResults [[90]] 通过使用 pollResults [[24058]] (换句话说,通过索引号或按区号)。



拥有此格式的数据意味着您还可以做很多其他方便的事情。例如,如果您想一次性修复所有308个CSV,您可以使用以下代码,这将创建新的CSV文件名称前缀为Corrected _。

  invisible(lapply(seq_along(pollResults),function(x){
NewFilename < - paste(Corrected,temp [x],sep =_ )
write.csv(pollResults [[x]],file = NewFilename,
quote = TRUE,row.names = FALSE)
}))

希望这有助于!


I have some data from and I am trying to load it into R. It is in .csv files and I can view the data in both Excel and OpenOffice. (If you are curious, it is the 2011 poll results data from Elections Canada data available here).

The data is coded in an unusual manner. A typical line is:

12002,Central Nova","Nova-Centre"," 1","River John",N,N,"",1,299,"Chisholm","","Matthew","Green Party","Parti Vert",N,N,11

There is a " on the end of the Central-Nova but not at the beginning. So in order to read in the data, I suppressed the quotes, which worked fine for the first few files. ie.

test<-read.csv("pollresults_resultatsbureau11001.csv",header = TRUE,sep=",",fileEncoding="latin1",as.is=TRUE,quote="")

Now here is the problem: in another file (eg. pollresults_resultatsbureau12002.csv), there is a line of data like this:

12002,Central Nova","Nova-Centre"," 6-1","Pictou, Subd. A",N,N,"",0,168,"Parker","","David K.","NDP-New Democratic Party","NPD-Nouveau Parti democratique",N,N,28

Because I need to suppress the quotes, the entry "Pictou, Subd. A" makes R wants to split this into 2 variables. The data can't be read in since it wants to add a column half way through constructing the dataframe.

Excel and OpenOffice both can open these files no problem. Somehow, Excel and OpenOffice know that quotation marks only matter if they are at the beginning of a variable entry.

Do you know what option I need to enable on R to get this data in? I have >300 files that I need to load (each with ~1000 rows each) so a manual fix is not an option...

I have looked all over the place for a solution but can't find one.

解决方案

Building on my comments, here is a solution that would read all the CSV files into a single list.

# Deal with French properly
options(encoding="latin1")

# Set your working directory to where you have
#   unzipped all of your 308 CSV files
setwd("path/to/unzipped/files")

# Get the file names
temp <- list.files()

# Extract the 5-digit code which we can use as names
Codes <- gsub("pollresults_resultatsbureau|.csv", "", temp)

# Read all the files into a single list named "pollResults"
pollResults <- lapply(seq_along(temp), function(x) {
  T0 <- readLines(temp[x])
  T0[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', T0[-1])
  final <- read.csv(text = T0, header = TRUE)
  final
})
names(pollResults) <- Codes

You can easily work with this list in different ways. If you wanted to just see the 90th data.frame you can access it by using pollResults[[90]] or by using pollResults[["24058"]] (in other words, either by index number or by district number).

Having the data in this format means you can also do a lot of other convenient things. For instance, if you wanted to fix all 308 of the CSVs in one go, you can use the following code, which will create new CSVs with the file name prefixed with "Corrected_".

invisible(lapply(seq_along(pollResults), function(x) {
  NewFilename <- paste("Corrected", temp[x], sep = "_")
  write.csv(pollResults[[x]], file = NewFilename, 
            quote = TRUE, row.names = FALSE)
}))

Hope this helps!

这篇关于报价读取数据到R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆