从CSV导入指定范围的值 [英] Importing from CSV from a specified range of values

查看:71
本文介绍了从CSV导入指定范围的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取CSV文件,但遇到以下错误。

I am trying to read in a CSV file and I am running into the following error.

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 1097 did not have 5 elements

进一步检查CSV文件后,我发现第1097行附近出现了行中断,并开始了一个包含年度数据的新标题(我现在对每月感兴趣)。

After further inspection of the CSV file I find that around line 1097 there is a break in the rows and starts a new header with annualised data (I am interested in monthly for now).

temp <- tempfile()
download.file("http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip",temp, mode="wb")
unzip(temp, "F-F_Research_Data_Factors.CSV")
French <- read.table("F-F_Research_Data_Factors.CSV", sep=",", skip = 3, header=T, nrows = 100)

上面的代码下载了zip文件,并将CSV文件导入到R中,前100行效果很好,但是前100行(出于说明目的)是数据点不是我特别感兴趣的1920和1930年代。

The above code downloads the zip file and imports the CSV file into R for the first 100 rows which works perfectly. However the first 100 rows (for illustrative purposes) are data points from the 1920´s and 1930´s which is not what I am particularly interested in.

我的问题是,如何从第一个逗号中的值导入数据分隔的CSV文件,即192607(1926-07)直到说195007(1950-07)-我能够通过更改nrow = 1095来导入最新值,但这不是我真正想要实现的目标。

My question is, how can I import data from a value in the first comma separated CSV file, i.e. 192607 (1926-07) until say 195007 (1950-07) -I am able to import the most recent values by changing nrow = 1095 but this is not what I exactly am trying to achieve.

数据快照;

,Mkt-RF,SMB,HML,RF
192607,    2.96,   -2.30,   -2.87,    0.22
192608,    2.64,   -1.40,    4.19,    0.25
192609,    0.36,   -1.32,    0.01,    0.23

...行1100

 Annual Factors: January-December 
,Mkt-RF,SMB,HML,RF
  1927,   29.47,   -2.46,   -3.75,    3.12
  1928,   35.39,    4.20,   -6.15,    3.56


推荐答案

文件中的第一个表位于前两个表之间零长度的线,这样就可以在没有之前和之前的垃圾的情况下读入之后,然后在指定的日期对其进行子集处理:

The first table in the file is between the first two zero length lines so this would read it in without the junk before and after and then subset it on the indicated dates:

# read first table in file
Lines <- readLines("F-F_Research_Data_Factors.CSV")
ix <- which(Lines == "")
DF0 <- read.csv(text = Lines[ix[1]:ix[2]])  # all rows in first table

# subset it to indicated dates
DF <- subset(DF0, X >= 192607 & X <= 195007)

注意:如果我们要所有表似乎以逗号开头的行开始于每个表,而空行结束于它们(除了第一个空行在表之前),因此从上方使用 Lines 会得到一个列表 L 的第i个组成部分是文件中的第i个表。

Note: If we want all the tables it appears that lines beginning with comma start each table and blank lines end them (except the first blank line comes before the tables) so using Lines from above this gives a list L whose ith component is the ith table in the file.

st <- grep("^,", Lines)  # starting line numbers
en <- which(Lines == "")[-1]  # ending line numbers
L <- Map(function(st, en) read.csv(text = Lines[st:en]), st, en)

这篇关于从CSV导入指定范围的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆