使用dplyr和/或cut将连续变量分为几类 [英] Breaking a continuous variable into categories using dplyr and/or cut

查看:112
本文介绍了使用dplyr和/或cut将连续变量分为几类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个记录价格变化以及其他变量的数据集。我想将价格列突变为分类变量。我知道R中这两个重要的函数似乎是 dplyr 和/或 cut

I have a dataset that is a record of price changes, among other variables. I would like to mutate the price column into a categorical variable. I understand that the two functions of importance here in R seem to be dplyr and/or cut.

> head(btc_data)
                 time  btc_price
1 2017-08-27 22:50:00 4,389.6113
2 2017-08-27 22:51:00 4,389.0850
3 2017-08-27 22:52:00 4,388.8625
4 2017-08-27 22:53:00 4,389.7888
5 2017-08-27 22:56:00 4,389.9138
6 2017-08-27 22:57:00 4,390.1663
   

>dput(btc_data)
        ("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763", 
        "4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325", 
        "4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025", 
        "4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075", 
        "4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738", 
        "4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788", 
        "4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038", 
        "4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288", 
        "5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788", 
        "5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175", 
        "5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350", 
        "5,013.9075"), class = "factor")), .Names = c("time", "btc_price"
    ), class = "data.frame", row.names = c(NA, -10023L))

困难在于我要创建的类别。 -1,0,1 类别应基于前一个时滞的百分比变化。

The difficulty is in the categories I want to create. The categories -1,0,1 should be based upon the % change over the previous time-lag.

因此,在过去60分钟内价格上涨20%将标记为1,否则为0。在过去60分钟内价格下降20%应该标记为-1,否则为0。

So for example, a 20% increase in price over the past 60 minutes would be labeled 1, otherwise 0. A 20% decrease in price over the past 60 minutes should be -1, otherwise 0.

是在R这可能吗?

有一个类似的问题此处以及此处,但是由于两个原因,这些都无法回答我的问题-

There is a similar question here and also here but these do not answer my question for two reasons-


a)我正在尝试计算%变化,而不仅仅是2行之间的差

a) I am trying to calculate % change, not simply the difference between 2 rows.

b)此计算应基于滚动时间范围内的最大值/最小值(即减少20%)过去一小时= -1,过去一小时内增长20%= 1

b) This calculation should be based on the max/min values for the rolling past time frame (ie- 20% decrease in the past hour = -1, 20% increase in the past hour = 1


推荐答案

它总是很难用百分比来工作的,您需要意识到每件事都是灵活的:当您选择一个有差异的参考,均值,最大值或其他参数时,您在参考侧至少要有两个变量您必须谨慎选择的功能。与您要相对于参考设置的值相同。这些共同给您几乎无限可能如何计算百分比。这是您问题的关键。

Its always difficult to work with percentage. You need to be aware that every thing is flexible: when you choose a reference which is a difference, a running mean, max or whatever - you have at least two variables on the side of the reference which you have to choose carefully. The same thing with the value you want to set in relation to your reference. Together this give you almost infinite possible how you can calculate your percentage. Here is the key to your question.

# create the data

dat <- c("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763", 
         "4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325", 
         "4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025", 
         "4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075", 
         "4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738", 
         "4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788", 
         "4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038", 
         "4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288", 
         "5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788", 
         "5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175", 
         "5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350", 
         "5,013.9075")
dat <- as.numeric(gsub(",","",dat))

# calculate the difference to the last minute
dd <- diff(dat)

# calculate the running ratio to difference of the last minutes
interval = 20
out <- NULL
for(z in interval:length(dd)){
  out <- c(out, (dd[z] / mean(dd[(z-interval):z])))
}

# calculate the running ratio to price of the last minutes
out2 <- NULL
for(z in interval:length(dd)){
  out2 <- c(out2, (dat[z] / mean(dat[(z-interval):z])))
}

# build categories for difference-ratio
catego <- as.vector(cut(out, breaks=c(-Inf,0.8,1.2,Inf), labels=c(-1,0,1)))
catego <- c(rep(NA,interval+1), as.numeric(catego))


# plot
plot(dat, type="b", main="price orginal")
plot(dd, main="absolute difference to last minute", type="b")
plot(out, main=paste('difference to last minute, relative to "mean" of the last', interval, 'min'), type="b")
abline(h=c(0.8, 1.2), col="magenta")
plot(catego, main=paste("categories for", interval))
plot(out2, main=paste('price last minute, relative to "mean" of the last', interval, 'min'), type="b")

我想您正在搜索如何计算最后一个情节(相对于t的均值,最后一分钟的价格...)。

I think you search the way how to calculate the last plot (price last minute, relative to "mean" of t...) the value in this example vary between 1.0010 and 1.0025 so far away from what you expect with 0.8 and 1.2. You can make the difference bigger when you choose a bigger time interval than 20min maybe a week could be good (11340) but even with this high time value it will be difficult to achieve a value above 1.2. The problem is the high price of 5000 a change of 10 is very little.

您还必须考虑到您给出的价格不断上涨,这是不可能的得出小于1的值。

You also have to take in account that you gave a continuously rising price, there it is impossible to get a value under 1.

在此计算中,我使用 mean()来连续观察最后一分钟。我不确定,但我推测在股市上您同时使用 min() max()作为参考在不同的时间间隔。价格上涨时,选择 min()作为参考,价格下跌时,选择 max()。所有这些在R中都是可能的。

In this calculation I use the mean() for the running observation of the last minutes. I'm not sure but I speculate that on stock markets you use both min() and max() as reference in different time interval. You choose min() as reference when your price is rising and max() when your price is falling. All this is possible in R.

这篇关于使用dplyr和/或cut将连续变量分为几类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆