如何在r中格式化具有重复日期的列 [英] how to format a column with duplicate dates in r

查看:291
本文介绍了如何在r中格式化具有重复日期的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的数据集保存在 .txt 文件中,我使用 fread() from data.table 。我想组织数据,看起来像这样:

I have a fairly large dataset saved in a .txt file, that i read into my environment using fread() from data.table. I would like to organize the data so that it looks like this:

# Desired output: Column headers = [ Date, Company Names, RET, RETX,PRC, vwretd, ewrtd, sprtrn]
# rows contain stock return data... (I made up the values below to save time, but the format must remain).


   Date        AMERICAN CAR & FDRY CO   ALASKA JUNEAU GOLD MNG C      RET       RETX   vwretd ...
1925-12-31          0.0432                    0.0231                 0.0032     0.053   ...
1926-01-02          0.0210                    0.0133                 0.0124     0.003   ...
1926-01-04          0.0324                    0.0131                 0.0134     0.023   ...
...                 ...                      ...                      ...        ...    ...

唯一的问题是我遇到麻烦,因为我的 crsp $ date 列有重复的日期
(SEE REPRODUCIBLE DATA),因为数据似乎是行绑定

The only problem is that I am having trouble because my crsp$date column has duplicate Dates (SEE REPRODUCIBLE DATA), as the data seems to be row binded

这是我使用 fread()

crsp <- fread("/Volumes/share/CRSP/CRSP_RETS/1925-1930.txt", sep="\t", header=TRUE)
crsp <- as.data.frame(crsp)

导入后,我运行以下格式,以格式化 date

After import I run the following to format date column

crsp$DATE <- as.Date(as.character(crsp$date),format="%Y%m%d")

然后我尝试转换为XTS



I then try to convert to XTS

data <- xts(crsp[,3:11], crsp$DATE)

data 返回以下内容:[ dput(data) code>]

data returns the following: [ dput(data) ]

data <- structure(c(NA, NA, NA, NA, NA, NA, "AMERICAN CAR & FDRY CO", 
"AMERICAN CAR & FDRY CO", "AMERICAN CAR & FDRY CO", "ALASKA JUNEAU GOLD MNG CO", 
"AMERICAN CAR & FDRY CO", "ALASKA JUNEAU GOLD MNG CO", "A", "A", 
"A", "A", "A", "A", "109.0000", "109.1250", "111.0000", "  1.8750", 
"110.5000", "  1.7500", "C", "0.001147", "0.017182", "C", "-0.004505", 
"-0.066667", "C", "0.001147", "0.017182", "C", "-0.004505", "-0.066667", 
NA, " 0.005893", " 0.001277", " 0.001277", "-0.003984", "-0.003984", 
NA, " 0.009516", " 0.005780", " 0.005780", "-0.001927", "-0.001927", 
NA, NA, NA, NA, NA, NA), .indexCLASS = "Date", tclass = "Date", .indexTZ = "UTC", tzone = "UTC",    
class = c("xts", 
"zoo"), index = structure(c(-1388620800, -1388448000, -1388275200, 
-1388275200, -1388188800, -1388188800), tzone = "UTC", tclass = "Date"), .Dim = c(6L, 
9L), .Dimnames = list(NULL, c("TICKER", "COMNAM", "TRDSTAT", 
"PRC", "RET", "RETX", "vwretd", "ewretd", "sprtrn")))



可重复数据



REPRODUCIBLE DATA

crsp <- structure(list(PERMNO = c(10006L, 10006L, 10006L, 10006L, 10006L, 
10006L, 10006L, 10006L, 10006L, 10014L, 10014L, 10014L, 10014L, 
10014L, 10014L, 10014L, 10014L), date = c(19251231L, 19260102L, 
19260104L, 19260105L, 19260106L, 19260107L, 19301229L, 19301230L, 
19301231L, 19260104L, 19260105L, 19260106L, 19260107L, 19260108L, 
19260109L, 19260111L, 19260112L), TICKER = c(NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), COMNAM = c("AMERICAN CAR & FDRY CO", "AMERICAN CAR & FDRY CO", 
"AMERICAN CAR & FDRY CO", "AMERICAN CAR & FDRY CO", "AMERICAN CAR & FDRY CO", 
"AMERICAN CAR & FDRY CO", "AMERICAN CAR & FDRY CO", "AMERICAN CAR & FDRY CO", 
"AMERICAN CAR & FDRY CO", "ALASKA JUNEAU GOLD MNG CO", "ALASKA JUNEAU GOLD MNG CO", 
"ALASKA JUNEAU GOLD MNG CO", "ALASKA JUNEAU GOLD MNG CO", "ALASKA JUNEAU GOLD MNG CO", 
"ALASKA JUNEAU GOLD MNG CO", "ALASKA JUNEAU GOLD MNG CO", "ALASKA JUNEAU GOLD MNG CO"
), TRDSTAT = c("A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A"), PRC = c(109, 109.125, 111, 
110.5, 110.5, 110.5, 24.25, 24.5, 27.5, 1.875, 1.75, 1.75, -1.8125, 
1.75, -1.6875, 1.625, NA), RET = c("C", "0.001147", "0.017182", 
"-0.004505", "0.000000", "0.000000", "-0.034826", "0.010309", 
"0.122449", "C", "-0.066667", "0.000000", "0.035714", "-0.034483", 
"-0.035714", "-0.037037", ""), RETX = c("C", "0.001147", "0.017182", 
"-0.004505", "0.000000", "0.000000", "-0.034826", "0.010309", 
"0.122449", "C", "-0.066667", "0.000000", "0.035714", "-0.034483", 
"-0.035714", "-0.037037", ""), vwretd = c(NA, 0.005893, 0.001277, 
-0.003984, -0.000172, 0.007211, -0.002367, 0.020064, 0.016505, 
0.001277, -0.003984, -0.000172, 0.007211, -0.000804, 0.003384, 
-0.00868, -0.001027), ewretd = c(NA, 0.009516, 0.00578, -0.001927, 
0.001182, 0.008453, -0.017949, 0.016016, 0.052093, 0.00578, -0.001927, 
0.001182, 0.008453, -0.001689, 0.003312, -0.009943, -0.003623
), sprtrn = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_)), .Names = c("PERMNO", 
"date", "TICKER", "COMNAM", "TRDSTAT", "PRC", "RET", "RETX", 
"vwretd", "ewretd", "sprtrn"), row.names = c(1L, 2L, 3L, 4L, 
5L, 6L, 1483L, 1484L, 1485L, 1486L, 1487L, 1488L, 1489L, 1490L, 
1491L, 1492L, 1493L), class = "data.frame")

格式 date 列(与上述相同的代码)

Format date column ( same code as above)

crsp$DATE <- as.Date(as.character(crsp$date),format="%Y%m%d")
data <- xts(crsp[,3:11], crsp$DATE)



更新



由@akrun生成的代码可以与示例数据有效地工作,但我遇到麻烦使用实际数据集:

UPDATE

Code produced by @akrun works efficiently with sample data but I am having trouble with actual dataset:

# this code works well on actual data
crsp1 <- crsp[,c(4, 6:12)];library(reshape2)

# this is the part where I am having difficulty with
crsp2 <- dcast(crsp1[,c(1:2, 8)], DATE~COMNAM, value.var="PRC")

它返回以下错误: 缺少聚合功能:默认为长度
这是 crsp2 返回的示例:

It returns the following error: Aggregation function missing: defaulting to length This is a sample of what crsp2 returns:

crsp2 <- structure(list(DATE = structure(c(-16072, -16070, -16068, -16067, 
-16066), class = "Date"), `A P W PAPER CO` = c(0, 0, 0, 0, 0), 
`ABITIBI POWER & PAPER LTD` = c(1, 1, 1, 1, 1), `ABRAHAM & STRAUSS INC` = c(0, 
0, 0, 0, 0), `ADAMS EXPRESS CO` = c(1, 1, 1, 1, 1)), .Names = c("DATE", 
"A P W PAPER CO", "ABITIBI POWER & PAPER LTD", "ABRAHAM & STRAUSS INC", 
"ADAMS EXPRESS CO"), row.names = c(NA, 5L), class = "data.frame")

格式,它返回所需的输出,但问题是它返回1和0而不是实际值...
我相信它返回1的数据被找到/启动,0是没有找到数据。我后来尝试指定聚合函数

As far as format , it returns the desired output, but the problem is that it returns 1's and 0's instead of actual values... I believe it returns 1's where data is found/starts, and 0's where no data is found. I later tried specifying the Aggregation function :

# Specification of aggregation function plus "fill" to try to fix length errors & "drop" to
# avoid values being dropped
crsp2 <- dcast(crsp1[,c(1:2, 8)], DATE ~ COMNAM, value.var="RET", fun.aggregate= length, fill=  
0);library(plyr);

这一次,我没有收到任何错误,但仍然得到1和0。也许最好通过 COMNAM 分隔数据,然后通过 DATE 列进行索引,以便我可以 merge() cbind()以获得所需的输出。但是我不知道是否有更好的方法?我也尝试了以下功能,但是让他们跑了6个多小时,回来,还在跑步......似乎不能让他们工作...

This time, I do not get any errors but still get the 1's and 0's. Perhaps it might be better to seperate the data by COMNAM and then index by DATE column, so that I can just merge() or cbind() to get the desired output. But i do not know if there is a better way? I have also tried the following functions but left them running for more than 6 hours and came back and where still running... cannot seem to make them work...

# daply() from library(plyr)
crsp2 <- daply(crsp1[,c(1:2, 8)], .(DATE,COMNAM), function(x) x$PRC, .parallel=TRUE)


# reshape()
crsp2 <- reshape(crsp1[,c(1:2, 8)], idvar="COMNAM", timevar="DATE", direction="wide") 


推荐答案

这里有两个与 xts 相关的问题:

There are two issues related to xts here:


  1. xts 假设索引的弱单调排序。允许重复的索引值,但是您不应该依赖重复值的排序(如果重复索引条目,则可能会抛出许多警告)。所以对于时间戳数据,我们使用帮助者,如 make.time.unique(),它们添加了最小的epsilon来分隔它们。

  1. xts assumes weakly monotonic ordering of the index. Duplicate index values are allowed, but you should not rely on the ordering of the duplicate values (and lots of warnings may be thrown if you have repeated index entries). So for timestamp data we use helpers like make.time.unique() which add the smallest possible epsilon to separate them.

xts 使用矩阵作为数据容器,这意味着您不能混合类型(例如,您只能具有数字数据,或仅包含字符数据等)

xts uses a matrix as the data container, which means you cannot mix types (e.g. you can only have numeric data, or only character data, etc).

所以这里,一个 data.frame 添加日期列或 data.table 可能会更好。

So here, a data.frame with an added Date column, or a data.table, may be better.

这篇关于如何在r中格式化具有重复日期的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆