使用python和pandas将OHLC股票数据转换为不同的时间范围 [英] Converting OHLC stock data into a different timeframe with python and pandas
问题描述
有人可以通过 Pandas 向我指出有关OHLC数据时间范围转换的正确方向吗?我正在尝试做的是使用较长时间范围的数据构建数据帧,给定具有较低时间范围的数据.
Could someone please point me in the right direction with respect to OHLC data timeframe conversion with Pandas? What I'm trying to do is build a Dataframe with data for higher timeframes, given data with lower timeframe.
例如,假设我有以下一分钟(M1)数据:
For example, given I have the following one-minute (M1) data:
Open High Low Close Volume
Date
1999-01-04 10:22:00 1.1801 1.1819 1.1801 1.1817 4
1999-01-04 10:23:00 1.1817 1.1818 1.1804 1.1814 18
1999-01-04 10:24:00 1.1817 1.1817 1.1802 1.1806 12
1999-01-04 10:25:00 1.1807 1.1815 1.1795 1.1808 26
1999-01-04 10:26:00 1.1803 1.1806 1.1790 1.1806 4
1999-01-04 10:27:00 1.1801 1.1801 1.1779 1.1786 23
1999-01-04 10:28:00 1.1795 1.1801 1.1776 1.1788 28
1999-01-04 10:29:00 1.1793 1.1795 1.1782 1.1789 10
1999-01-04 10:31:00 1.1780 1.1792 1.1776 1.1792 12
1999-01-04 10:32:00 1.1788 1.1792 1.1788 1.1791 4
它具有每分钟的开盘价,高价,低价,收盘价(OHLC)和音量值,我想建立一组5分钟的读数(M5),如下所示:
which has Open, High, Low, Close (OHLC) and volume values for every minute I would like to build a set of 5-minute readings (M5) which would look like so:
Open High Low Close Volume
Date
1999-01-04 10:25:00 1.1807 1.1815 1.1776 1.1789 91
1999-01-04 10:30:00 1.1780 1.1792 1.1776 1.1791 16
工作流程是这样的:
- 打开是时间窗口中第一行的打开
- High是时间窗口中的最高High
- 低是最低的低
- 关闭是最后一个关闭
- 体积仅仅是体积的总和
但是有几个问题:
- 数据存在空白(请注意没有10:30:00行)
- 5分钟的间隔必须从一轮开始,例如M5从10:25:00开始,而不是10:22:00
- 首先,不完整的集合可以像本例一样被省略或包含(这样我们可以有10:20:00 5分钟的输入时间)
有关上下采样的熊猫文档给出了一个示例,但他们使用平均值作为上采样行的值,在这里不起作用.我尝试使用groupby
和agg
,但无济于事.对于获得最高价和最低价的人来说可能并不难,但是我不知道如何获得第一个开盘价和最后一个收盘价.
The Pandas documentation on up-down sampling gives an example, but they use mean value as the value of up-sampled row, which won't work here. I have tried using groupby
and agg
but to no avail. For one getting highest High and lowest Low might be not so hard, but I have no idea how to get first Open and last Close.
我尝试过的方式类似于:
What I tried is something along the lines of:
grouped = slice.groupby( dr5minute.asof ).agg(
{ 'Low': lambda x : x.min()[ 'Low' ], 'High': lambda x : x.max()[ 'High' ] }
)
但是会导致以下错误,我不理解:
but it results in following error, which I don't understand:
In [27]: grouped = slice.groupby( dr5minute.asof ).agg( { 'Low' : lambda x : x.min()[ 'Low' ], 'High' : lambda x : x.max()[ 'High' ] } )
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/work/python/fxcruncher/<ipython-input-27-df50f9522a2f> in <module>()
----> 1 grouped = slice.groupby( dr5minute.asof ).agg( { 'Low' : lambda x : x.min()[ 'Low' ], 'High' : lambda x : x.max()[ 'High' ] } )
/usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in agg(self, func, *args, **kwargs)
242 See docstring for aggregate
243 """
--> 244 return self.aggregate(func, *args, **kwargs)
245
246 def _iterate_slices(self):
/usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, arg, *args, **kwargs)
1153 colg = SeriesGroupBy(obj[col], column=col,
1154 grouper=self.grouper)
-> 1155 result[col] = colg.aggregate(func)
1156
1157 result = DataFrame(result)
/usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, func_or_funcs, *args, **kwargs)
906 return self._python_agg_general(func_or_funcs, *args, **kwargs)
907 except Exception:
--> 908 result = self._aggregate_named(func_or_funcs, *args, **kwargs)
909
910 index = Index(sorted(result), name=self.grouper.names[0])
/usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in _aggregate_named(self, func, *args, **kwargs)
976 grp = self.get_group(name)
977 grp.name = name
--> 978 output = func(grp, *args, **kwargs)
979 if isinstance(output, np.ndarray):
980 raise Exception('Must produce aggregated value')
/work/python/fxcruncher/<ipython-input-27-df50f9522a2f> in <lambda>(x)
----> 1 grouped = slice.groupby( dr5minute.asof ).agg( { 'Low' : lambda x : x.min()[ 'Low' ], 'High' : lambda x : x.max()[ 'High' ] } )
IndexError: invalid index to scalar variable.
因此,对此的任何帮助将不胜感激.如果我选择的路径行不通,请提出其他相对有效的方法(我有数百万行).使用熊猫进行财务处理的一些资源也很好.
So any help on doing that would be greatly appreciated. If the path I chose is not going to work, please suggest other relatively efficient approach (I have millions of rows). Some resources on using Pandas for financial processing would also be nice.
推荐答案
您的方法是正确的,但由于将函数dict中的每个函数应用于agg()而失败 接收一个反映与键值匹配的列的Series对象.因此,没有必要 再次过滤列标签.这样,假设groupby保留顺序, 您可以对系列进行切片以提取打开/关闭"的第一个/最后一个元素 列(注意:groupby文档未声明保留原始数据的顺序 系列,但似乎在实践中.)
Your approach is sound, but fails because each function in the dict-of-functions applied to agg() receives a Series object reflecting the column matched by the key value. Therefore, it's not necessary to filter on column label again. With this, and assuming groupby preserves order, you can slice the Series to extract the first/last element of the Open/Close columns (note: groupby documentation does not claim to preserve order of original data series, but seems to in practice.)
In [50]: df.groupby(dr5minute.asof).agg({'Low': lambda s: s.min(),
'High': lambda s: s.max(),
'Open': lambda s: s[0],
'Close': lambda s: s[-1],
'Volume': lambda s: s.sum()})
Out[50]:
Close High Low Open Volume
key_0
1999-01-04 10:20:00 1.1806 1.1819 1.1801 1.1801 34
1999-01-04 10:25:00 1.1789 1.1815 1.1776 1.1807 91
1999-01-04 10:30:00 1.1791 1.1792 1.1776 1.1780 16
作为参考,这是一张表格,汇总了预期的 基于groupby对象类型的聚合函数的输入和输出类型,以及如何将聚合函数传递给agg().
For reference, here is a table to summarize the expected input and output types of an aggregation function based on the groupby object type and how the aggregation function(s) is/are passed to agg().
agg() method agg func agg func agg()
input type accepts returns result
GroupBy Object
SeriesGroupBy function Series value Series
dict-of-funcs Series value DataFrame, columns match dict keys
list-of-funcs Series value DataFrame, columns match func names
DataFrameGroupBy function DataFrame Series/dict/ary DataFrame, columns match original DataFrame
dict-of-funcs Series value DataFrame, columns match dict keys, where dict keys must be columns in original DataFrame
list-of-funcs Series value DataFrame, MultiIndex columns (original cols x func names)
从上表中,如果聚合要求访问多个 列,唯一的选择是将单个函数传递给 DataFrameGroupBy对象.因此,完成原始任务的另一种方法是定义 功能如下:
From the above table, if aggregation requires access to more than one column, the only option is to pass a single function to a DataFrameGroupBy object. Therefore, an alternate way to accomplish the original task is to define a function like the following:
def ohlcsum(df):
df = df.sort()
return {
'Open': df['Open'][0],
'High': df['High'].max(),
'Low': df['Low'].min(),
'Close': df['Close'][-1],
'Volume': df['Volume'].sum()
}
并对其应用agg():
and apply agg() with it:
In [30]: df.groupby(dr5minute.asof).agg(ohlcsum)
Out[30]:
Open High Low Close Volume
key_0
1999-01-04 10:20:00 1.1801 1.1819 1.1801 1.1806 34
1999-01-04 10:25:00 1.1807 1.1815 1.1776 1.1789 91
1999-01-04 10:30:00 1.1780 1.1792 1.1776 1.1791 16
尽管熊猫将来可能会提供一些更简洁的内置魔术,但希望这可以解释如何使用当今的agg()功能.
Though pandas may offer some cleaner built-in magic in the future, hopefully this explains how to work with today's agg() capabilities.
这篇关于使用python和pandas将OHLC股票数据转换为不同的时间范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!