如何在r中格式化具有重复日期的列 [英] how to format a column with duplicate dates in r
问题描述
我有一个相当大的数据集保存在 .txt
文件中,我使用 fread()
from data.table
。我想组织数据,看起来像这样:
I have a fairly large dataset saved in a .txt
file, that i read into my environment using fread()
from data.table
. I would like to organize the data so that it looks like this:
# Desired output: Column headers = [ Date, Company Names, RET, RETX,PRC, vwretd, ewrtd, sprtrn]
# rows contain stock return data... (I made up the values below to save time, but the format must remain).
Date AMERICAN CAR & FDRY CO ALASKA JUNEAU GOLD MNG C RET RETX vwretd ...
1925-12-31 0.0432 0.0231 0.0032 0.053 ...
1926-01-02 0.0210 0.0133 0.0124 0.003 ...
1926-01-04 0.0324 0.0131 0.0134 0.023 ...
... ... ... ... ... ...
唯一的问题是我遇到麻烦,因为我的 crsp $ date
列有重复的日期
(SEE REPRODUCIBLE DATA),因为数据似乎是行绑定
The only problem is that I am having trouble because my crsp$date
column has duplicate Dates
(SEE REPRODUCIBLE DATA), as the data seems to be row binded
这是我使用 fread()
crsp <- fread("/Volumes/share/CRSP/CRSP_RETS/1925-1930.txt", sep="\t", header=TRUE)
crsp <- as.data.frame(crsp)
导入后,我运行以下格式,以格式化 date
列
After import I run the following to format date
column
crsp$DATE <- as.Date(as.character(crsp$date),format="%Y%m%d")
然后我尝试转换为XTS
I then try to convert to XTS
data <- xts(crsp[,3:11], crsp$DATE)
data
返回以下内容:[ dput(data) code>]
data
returns the following: [ dput(data)
]
data <- structure(c(NA, NA, NA, NA, NA, NA, "AMERICAN CAR & FDRY CO",
"AMERICAN CAR & FDRY CO", "AMERICAN CAR & FDRY CO", "ALASKA JUNEAU GOLD MNG CO",
"AMERICAN CAR & FDRY CO", "ALASKA JUNEAU GOLD MNG CO", "A", "A",
"A", "A", "A", "A", "109.0000", "109.1250", "111.0000", " 1.8750",
"110.5000", " 1.7500", "C", "0.001147", "0.017182", "C", "-0.004505",
"-0.066667", "C", "0.001147", "0.017182", "C", "-0.004505", "-0.066667",
NA, " 0.005893", " 0.001277", " 0.001277", "-0.003984", "-0.003984",
NA, " 0.009516", " 0.005780", " 0.005780", "-0.001927", "-0.001927",
NA, NA, NA, NA, NA, NA), .indexCLASS = "Date", tclass = "Date", .indexTZ = "UTC", tzone = "UTC",
class = c("xts",
"zoo"), index = structure(c(-1388620800, -1388448000, -1388275200,
-1388275200, -1388188800, -1388188800), tzone = "UTC", tclass = "Date"), .Dim = c(6L,
9L), .Dimnames = list(NULL, c("TICKER", "COMNAM", "TRDSTAT",
"PRC", "RET", "RETX", "vwretd", "ewretd", "sprtrn")))
可重复数据
REPRODUCIBLE DATA
crsp <- structure(list(PERMNO = c(10006L, 10006L, 10006L, 10006L, 10006L,
10006L, 10006L, 10006L, 10006L, 10014L, 10014L, 10014L, 10014L,
10014L, 10014L, 10014L, 10014L), date = c(19251231L, 19260102L,
19260104L, 19260105L, 19260106L, 19260107L, 19301229L, 19301230L,
19301231L, 19260104L, 19260105L, 19260106L, 19260107L, 19260108L,
19260109L, 19260111L, 19260112L), TICKER = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), COMNAM = c("AMERICAN CAR & FDRY CO", "AMERICAN CAR & FDRY CO",
"AMERICAN CAR & FDRY CO", "AMERICAN CAR & FDRY CO", "AMERICAN CAR & FDRY CO",
"AMERICAN CAR & FDRY CO", "AMERICAN CAR & FDRY CO", "AMERICAN CAR & FDRY CO",
"AMERICAN CAR & FDRY CO", "ALASKA JUNEAU GOLD MNG CO", "ALASKA JUNEAU GOLD MNG CO",
"ALASKA JUNEAU GOLD MNG CO", "ALASKA JUNEAU GOLD MNG CO", "ALASKA JUNEAU GOLD MNG CO",
"ALASKA JUNEAU GOLD MNG CO", "ALASKA JUNEAU GOLD MNG CO", "ALASKA JUNEAU GOLD MNG CO"
), TRDSTAT = c("A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A"), PRC = c(109, 109.125, 111,
110.5, 110.5, 110.5, 24.25, 24.5, 27.5, 1.875, 1.75, 1.75, -1.8125,
1.75, -1.6875, 1.625, NA), RET = c("C", "0.001147", "0.017182",
"-0.004505", "0.000000", "0.000000", "-0.034826", "0.010309",
"0.122449", "C", "-0.066667", "0.000000", "0.035714", "-0.034483",
"-0.035714", "-0.037037", ""), RETX = c("C", "0.001147", "0.017182",
"-0.004505", "0.000000", "0.000000", "-0.034826", "0.010309",
"0.122449", "C", "-0.066667", "0.000000", "0.035714", "-0.034483",
"-0.035714", "-0.037037", ""), vwretd = c(NA, 0.005893, 0.001277,
-0.003984, -0.000172, 0.007211, -0.002367, 0.020064, 0.016505,
0.001277, -0.003984, -0.000172, 0.007211, -0.000804, 0.003384,
-0.00868, -0.001027), ewretd = c(NA, 0.009516, 0.00578, -0.001927,
0.001182, 0.008453, -0.017949, 0.016016, 0.052093, 0.00578, -0.001927,
0.001182, 0.008453, -0.001689, 0.003312, -0.009943, -0.003623
), sprtrn = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_)), .Names = c("PERMNO",
"date", "TICKER", "COMNAM", "TRDSTAT", "PRC", "RET", "RETX",
"vwretd", "ewretd", "sprtrn"), row.names = c(1L, 2L, 3L, 4L,
5L, 6L, 1483L, 1484L, 1485L, 1486L, 1487L, 1488L, 1489L, 1490L,
1491L, 1492L, 1493L), class = "data.frame")
格式 date
列(与上述相同的代码)
Format date
column ( same code as above)
crsp$DATE <- as.Date(as.character(crsp$date),format="%Y%m%d")
data <- xts(crsp[,3:11], crsp$DATE)
更新
由@akrun生成的代码可以与示例数据有效地工作,但我遇到麻烦使用实际数据集:
UPDATE
Code produced by @akrun works efficiently with sample data but I am having trouble with actual dataset:
# this code works well on actual data
crsp1 <- crsp[,c(4, 6:12)];library(reshape2)
# this is the part where I am having difficulty with
crsp2 <- dcast(crsp1[,c(1:2, 8)], DATE~COMNAM, value.var="PRC")
它返回以下错误: 缺少聚合功能:默认为长度
这是 crsp2
返回的示例:
It returns the following error: Aggregation function missing: defaulting to length
This is a sample of what crsp2
returns:
crsp2 <- structure(list(DATE = structure(c(-16072, -16070, -16068, -16067,
-16066), class = "Date"), `A P W PAPER CO` = c(0, 0, 0, 0, 0),
`ABITIBI POWER & PAPER LTD` = c(1, 1, 1, 1, 1), `ABRAHAM & STRAUSS INC` = c(0,
0, 0, 0, 0), `ADAMS EXPRESS CO` = c(1, 1, 1, 1, 1)), .Names = c("DATE",
"A P W PAPER CO", "ABITIBI POWER & PAPER LTD", "ABRAHAM & STRAUSS INC",
"ADAMS EXPRESS CO"), row.names = c(NA, 5L), class = "data.frame")
至格式
,它返回所需的输出,但问题是它返回1和0而不是实际值...
我相信它返回1的数据被找到/启动,0是没有找到数据。我后来尝试指定聚合函数
:
As far as format
, it returns the desired output, but the problem is that it returns 1's and 0's instead of actual values...
I believe it returns 1's where data is found/starts, and 0's where no data is found. I later tried specifying the Aggregation function
:
# Specification of aggregation function plus "fill" to try to fix length errors & "drop" to
# avoid values being dropped
crsp2 <- dcast(crsp1[,c(1:2, 8)], DATE ~ COMNAM, value.var="RET", fun.aggregate= length, fill=
0);library(plyr);
这一次,我没有收到任何错误,但仍然得到1和0。也许最好通过 COMNAM
分隔数据,然后通过 DATE
列进行索引,以便我可以 merge()
或 cbind()
以获得所需的输出。但是我不知道是否有更好的方法?我也尝试了以下功能,但是让他们跑了6个多小时,回来,还在跑步......似乎不能让他们工作...
This time, I do not get any errors but still get the 1's and 0's. Perhaps it might be better to seperate the data by COMNAM
and then index by DATE
column, so that I can just merge()
or cbind()
to get the desired output. But i do not know if there is a better way? I have also tried the following functions but left them running for more than 6 hours and came back and where still running... cannot seem to make them work...
# daply() from library(plyr)
crsp2 <- daply(crsp1[,c(1:2, 8)], .(DATE,COMNAM), function(x) x$PRC, .parallel=TRUE)
# reshape()
crsp2 <- reshape(crsp1[,c(1:2, 8)], idvar="COMNAM", timevar="DATE", direction="wide")
推荐答案
这里有两个与 xts
相关的问题:
There are two issues related to xts
here:
-
xts
假设索引的弱单调排序。允许重复的索引值,但是您不应该依赖重复值的排序(如果重复索引条目,则可能会抛出许多警告)。所以对于时间戳数据,我们使用帮助者,如make.time.unique()
,它们添加了最小的epsilon来分隔它们。
xts
assumes weakly monotonic ordering of the index. Duplicate index values are allowed, but you should not rely on the ordering of the duplicate values (and lots of warnings may be thrown if you have repeated index entries). So for timestamp data we use helpers likemake.time.unique()
which add the smallest possible epsilon to separate them.
xts
使用矩阵作为数据容器,这意味着您不能混合类型(例如,您只能具有数字数据,或仅包含字符数据等)
xts
uses a matrix as the data container, which means you cannot mix types (e.g. you can only have numeric data, or only character data, etc).
所以这里,一个 data.frame
添加日期
列或 data.table
可能会更好。
So here, a data.frame
with an added Date
column, or a data.table
, may be better.
这篇关于如何在r中格式化具有重复日期的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!