Clojure中的复杂数据处理 [英] Complex data manipulation in Clojure

查看:79
本文介绍了Clojure中的复杂数据处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事个人市场分析项目。我有一个表示市场最近所有转折点的数据结构,如下所示:

  [{:high 1.121455,:time 2016-08-03T05:15:00.000000Z} 
{:低1.12109,:time 2016-08-03T05:15:00.000000Z}
{:高1.12173, :time 2016-08-03T04:30:00.000000Z}
{:high 1.121925,:time 2016-08-03T00:00:00.000000Z}
{:high 1.12215,:time 2016-08-02T23:00:00.000000Z}
{:高1.12273,:time 2016-08-02T21:15:00.000000Z}
{:高1.12338,:time 2016 -08-02T18:15:00.000000Z}
{:低1.119215,:time 2016-08-02T12:30:00.000000Z}
{:低1.118755,:time 2016-08 -02T12:00:00.000000Z}
{:低1.117575,:time 2016-08-02T06:00:00.000000Z}
{:低1.117135,:time 2016-08-02T04 :30:00.000000Z}
{:低1.11624,:时间 2016-08-02T02:00:00.000000Z}
{:低1.115895,:time 2016-08-01T21:30 :00.000000Z}
{:低1.11552,:时间 2016-08-01T11:45:00.000000Z}
{:低1.11049,:time 2016-07-29T12:15:00 .000000Z}
{:低1.108825,:time 2016-07-29T08:30:00.000000Z}}
{:低1.10839,:time 2016-07-29T08:00:00.000000Z }
{:低1.10744,:time 2016-07-29T05:45:00.000000Z}
{:低1.10716,:time 2016-07-28T19:30:00.000000Z}
{:低1.10705,:time 2016-07-28T18:45:00.000000Z}
{:低1.106875,:time 2016-07-28T18:00:00.000000Z}
{:低1.10641 ,:时间 2016-07-28T05:45:00.000000Z}
{:低1.10591 ,:时间 2016-07-28T01:45:00.000000Z}
{:low 1.10579,:time 2016-07-27T23:15:00.000000Z}
{:low 1.105275,:time 2016-07-27T22:00:00.000000Z}
{:低1.096135,:time 2016-07-27T18:00:00.000000Z}]

我想配对:high / :low 对,计算价格范围(高-低)和中点(高和低),但我不想生成所有可能的对。



我要做的是从集合 {:high 1.121455,:time 2016-08-03T05:15:00.000000Z} 并向下浏览集合的其余部分,并在每个:low 项,直到我点击了下一个:high 项。一旦我击中下一个:high 物品,我就不再对其他任何物品感兴趣。在这种情况下,只创建了一个对,即:high 和第一个:low -我停了因为下一个(第3个)项目是:high 。 1生成的记录应类似于 {:price-range 0.000365,:midpoint 1.121272,:extremes [{:high 1.121455,:time 2016-08-03T05:15:00.000000Z} {:low 1.12109,:time 2016-08-03T05:15:00.000000Z}]}



接下来,我进入第二个集合 {:low 1.12109,:time 2016-08-03T05:15:00.000000Z} 中的项,然后在集合的其余部分中向下浏览,每一个:high 项目都是一对,直到我击中下一个:low 项目。在这种情况下,我生成了5条新记录,分别是:low 和接下来的5条:high 项。所有连续这5条记录中的第一个看起来像是

  {:price-range 0.000064,:midpoint 1.12131,:extremes [{:low 1.12109,:time 2016-08-03T05:15:00.000000Z} {:high 1.12173,:time 2016-08-03T04:30:00.000000Z}]} 

这5条记录中的第二条看起来像

  {:价格范围0.000835 ,:中点1.1215075,:extremes [{:低1.12109 ,:时间 2016-08-03T05:15:00.000000Z} {:高1.121925 ,:时间 2016-08-03T00:00 :00.000000Z}]} 

,依此类推。



之后,我得到一个:low ,所以我就停在那里。



然后我会移至第三项 {:high 1.12173,:time 2016-08-03T04:30:00.000000Z} 并向下创建与每个:low 直到下一个:high 。在这种情况下,我生成了0对,因为:high 后面紧跟着另一个:high 。接下来的3个:high项目相同,所有这些都紧随其后的是另一个:high



接下来我进入第七项 {:high 1.12338,:time 2016-08-02T18:15:00.000000Z} ,它应该与以下20个:low 项目。



我生成的结果将是创建的所有对的列表:

  [{:price-range 0.000365,:midpoint 1.121272,:extremes [{:high 1.121455,:time 2016-08-03T05:15:00.000000Z } {:低1.12109,:time 2016-08-03T05:15:00.000000Z}]} 
{:price-range 0.000064,:midpoint 1.12131,:extremes [{:low 1.12109,:time 2016-08-03T05:15:00.000000Z} {:high 1.12173,:time 2016-08-03T04:30:00.000000Z}]}
...

如果我使用类似Python的方法实现此功能,则可能会使用几个嵌套循环,请使用 break 当我停止看到:high 与我的:low 一个反之亦然,当我遍历2个循环时,将所有生成的记录累积到一个数组中。我只是无法找到一种使用Clojure攻击它的好方法...



任何想法吗?

解决方案

首先,您可以使用以下方式重新编写该语句:


  1. 您必须找到所有边界点,其中:high 后跟:low ,反之亦然

  2. 您需要将项目放置在 边界之前,并使用它以及边界之后的每个项目制造东西,但是直到下一个切换边界为止。

为简单起见,我们使用以下数据模型:

 (def data0 [{ a 1} {:b 2} {:b 3} {:b 4} {:a 5} {:a 6} {:a 7}])

第一部分可以通过使用 partition-by 函数来实现,该函数在每次函数更改时都会拆分输入集合它是已处理项目的值:

  user> (def step1(partition-by(comp boolean:a)data0))
#’user / step1
user> step1
(({:a 1})({:b 2} {:b 3} {:b 4})({:a 5} {:a 6} {:a 7}))

现在您需要将这两个组中的每一个都进行操作。组应该是这样的:
[({:a 1})({:b 2} {:b 3} {:b 4})]
[({:b 2} {: b 3} {:b 4})({:a 5} {:a 6} {:a 7})]



这是通过分区函数:

  user> (def step2(分区2 1 step1))
#’用户/ step2
用户> step2
(((({:a 1})({:b 2} {:b 3} {:b 4}))
(({{b 2} {:b 3} {: b 4})({:a 5} {:a 6} {:a 7})))

每组都要做些事情。您可以使用地图来做到这一点:

  user> (def step3(map(fn [[lbounds rbounds]] 
(map#(vector(last lbounds)%)
rbounds))
step2))
#'user / step3
用户> step3
(([[{:a 1} {:b 2}] [{:a 1} {:b 3}] [{:a 1} {:b 4}])
([ {:b 4} {:a 5}] [{:b 4} {:a 6}] [{:b 4} {:a 7}]))

,但是由于您需要连接列表,而不是分组列表,因此您想使用 mapcat 地图

  user> (def step3(mapcat(fn [[lbounds rbounds]] 
(map#(vector(last lbounds)%)
rbounds))
step2))
#'user / step3
用户> step3
([{{:a 1} {:b 2}]
[{:a 1} {:b 3}]
[{:a 1} {:b 4}]
[{:b 4} {:a 5}]
[{:b 4} {:a 6}]
[{:b 4} {:a 7}])

这就是我们想要的结果(它几乎是,因为我们只是生成矢量,而不是地图)。 / p>

现在您可以使用线程宏对其进行美化:

 (- >> data0 
(分区(comp boolean:a))
(分区2 1)
(mapcat(fn [[lbounds rbounds]]
(map# (向量(最近的lbounds%)
rbounds))))



应用于您的数据,看起来几乎是相同的(另一个结果生成fn)

  user> (defn hi-or-lo [item] 
(item:high(item:low)))
#’user / hi-or-lo
user>
(->数据
(分区(comp boolean:high))
(分区2 1)
(mapcat(fn [[lbounds rbounds]]
(让[左边界(最后一个边界)
左边界(hi-or-lo左边界)]
(地图#(let [右边界(hi-or-lo %)
差异(算术/绝对(-右边界左边界))]
{:极限[左边界%]
:价格范围差异
:中点(+(最小右值左值)
(/ diff 2))})
反弹))))
(clojure.pprint / pprint))

它会打印以下内容:

 ({:extremes 
[{:high 1.121455,:time 2016-08-03T05:15:00.000000Z}
{:low 1.12109,:time 2016-08-03T05:15 :00.000000Z}],
:价格范围3.6500000000017074E-4,
:中点1.1212725}
{:极端
[ {:低1.12109 ,:时间 2016-08-03T05:15:00.000000Z}
{:高1.12173 ,:时间 2016-08-03T04:30:00.000000Z}],
:price-range 6.399999999999739E-4,
:midpoint 1.12141}
{:extremes
[{:low 1.12109,:time 2016-08-03T05:15:00.000000Z}
{:high 1.121925,:time 2016-08-03T00:00:00.000000Z}],
:价格范围8.350000000001412E-4,
:中点1.1215074999999999}
{:extremes
[{:low 1.12109,:time 2016-08-03T05:15:00.000000Z}}
{:high 1.12215,:time 2016-08-02T23:00:00.000000Z }],
:价格范围0.001060000000000061,
:midpoint 1.12162}
{:extremes
[{:low 1.12109,:time 2016-08-03T05:15: 00.000000Z}
{:高1.12273,:time 2016-08-02T21:15:00.000000Z}],
:价格范围0.0016400000000000858,
:midpoint 1.12191}
{:extremes
[{:low 1.12109,:time 2016-08-03T05:15:00.000000Z}}
{:high 1.12338,:time 2016-08-02T18:15: 00.000000Z}],
:价格范围0.002290000000000125 3,
:midpoint 1.1222349999999999}
{:extremes
[{:high 1.12338,:time 2016-08-02T18:15:00.000000Z}
{:low 1.119215 ,:time 2016-08-02T12:30:00.000000Z}],
:价格范围0.004164999999999974,
:midpoint 1.1212975}
{:extremes
[{:高1.12338,:time 2016-08-02T18:15:00.000000Z}
{:低1.118755,:time 2016-08-02T12:00:00.000000Z}],
:price -范围0.004625000000000101,
:midpoint 1.1210675}
...

回答有关复杂数据操作的问题,我建议您从clojure核心中浏览所有集合的操作功能,然后尝试将任何任务分解为这些应用程序。没有那么多的情况,当您需要除它们之外的其他东西。


I'm working on a personal market analysis project. I've got a data structure representing all the recent turning points in the market, that looks like this:

[{:high 1.121455, :time "2016-08-03T05:15:00.000000Z"}
 {:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
 {:high 1.12173, :time "2016-08-03T04:30:00.000000Z"}
 {:high 1.121925, :time "2016-08-03T00:00:00.000000Z"}
 {:high 1.12215, :time "2016-08-02T23:00:00.000000Z"}
 {:high 1.12273, :time "2016-08-02T21:15:00.000000Z"}
 {:high 1.12338, :time "2016-08-02T18:15:00.000000Z"}
 {:low 1.119215, :time "2016-08-02T12:30:00.000000Z"}
 {:low 1.118755, :time "2016-08-02T12:00:00.000000Z"}
 {:low 1.117575, :time "2016-08-02T06:00:00.000000Z"}
 {:low 1.117135, :time "2016-08-02T04:30:00.000000Z"}
 {:low 1.11624, :time "2016-08-02T02:00:00.000000Z"}
 {:low 1.115895, :time "2016-08-01T21:30:00.000000Z"}
 {:low 1.11552, :time "2016-08-01T11:45:00.000000Z"}
 {:low 1.11049, :time "2016-07-29T12:15:00.000000Z"}
 {:low 1.108825, :time "2016-07-29T08:30:00.000000Z"}
 {:low 1.10839, :time "2016-07-29T08:00:00.000000Z"}
 {:low 1.10744, :time "2016-07-29T05:45:00.000000Z"}
 {:low 1.10716, :time "2016-07-28T19:30:00.000000Z"}
 {:low 1.10705, :time "2016-07-28T18:45:00.000000Z"}
 {:low 1.106875, :time "2016-07-28T18:00:00.000000Z"}
 {:low 1.10641, :time "2016-07-28T05:45:00.000000Z"}
 {:low 1.10591, :time "2016-07-28T01:45:00.000000Z"}
 {:low 1.10579, :time "2016-07-27T23:15:00.000000Z"}
 {:low 1.105275, :time "2016-07-27T22:00:00.000000Z"}
 {:low 1.096135, :time "2016-07-27T18:00:00.000000Z"}]

Conceptually, I want to match up :high/:low pairs, work out the price range (high-low) and midpoint (average of high & low), but I don't want every possible pair to be generated.

What I want to do is start from the 1st item in the collection {:high 1.121455, :time "2016-08-03T05:15:00.000000Z"} and walk "down" through the remainder of the collection, creating a pair with every :low item UNTIL I hit the next :high item. Once I hit that next :high item, I'm not interested in any further pairs. In this case, there's only a single pair created, which is the :high and the 1st :low - I stop there because the next (3rd) item is a :high. The 1 generated record should look like {:price-range 0.000365, :midpoint 1.121272, :extremes [{:high 1.121455, :time "2016-08-03T05:15:00.000000Z"}{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}]}

Next, I'd move onto the 2nd item in the collection {:low 1.12109, :time "2016-08-03T05:15:00.000000Z"} and walk "down" through the remainder of the collection, creating a pair with every :high item UNTIL I hit the next :low item. In this case, I get 5 new records generated, being the :low and the next 5 :high items which are all consecutive; the first of these 5 records would look like

{:price-range 0.000064, :midpoint 1.12131, :extremes [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}{:high 1.12173, :time "2016-08-03T04:30:00.000000Z"}]}

the second of these 5 records would look like

{:price-range 0.000835, :midpoint 1.1215075, :extremes [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}{:high 1.121925, :time "2016-08-03T00:00:00.000000Z"}]}

and so on.

After that, I get a :low so I stop there.

Then I'd move onto the 3rd item {:high 1.12173, :time "2016-08-03T04:30:00.000000Z"} and walk "down" creating pairs with every :low UNTIL I hit the next :high. In this case, I get 0 pairs generated, because the :high is followed immediately by another :high. Same for the next 3 :high items, which are all followed immediately by another :high

Next I get to the 7th item {:high 1.12338, :time "2016-08-02T18:15:00.000000Z"} and that should generate a pair with each of the following 20 :low items.

My generated result would be a list of all the pairs created:

[{:price-range 0.000365, :midpoint 1.121272, :extremes [{:high 1.121455, :time "2016-08-03T05:15:00.000000Z"}{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}]}
 {:price-range 0.000064, :midpoint 1.12131, :extremes [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}{:high 1.12173, :time "2016-08-03T04:30:00.000000Z"}]}
 ...

If I was implementing this using something like Python, I'd probably use a couple of nested loops, use a break to exit the inner loop when I stopped seeing :highs to pair with my :low and vice-versa, and accumulate all the generated records into an array as I traversed the 2 loops. I just can't work out a good way to attack it using Clojure...

Any ideas?

解决方案

first of all you can rephrase this the following way:

  1. you have to find all the boundary points, where :high is followed by :low, or vice versa
  2. you need to take the item before the bound, and make something with it and every item after bound, but until the next switching bound.

for the simplicity let's use the following data model:

(def data0 [{:a 1} {:b 2} {:b 3} {:b 4} {:a 5} {:a 6} {:a 7}])

the first part can be achieved by using partition-by function, that splits the input collection every time the function changes it's value for the processed item:

user> (def step1 (partition-by (comp boolean :a) data0))
#'user/step1
user> step1
(({:a 1}) ({:b 2} {:b 3} {:b 4}) ({:a 5} {:a 6} {:a 7}))

now you need to take every two of these groups and manipulate them. the groups should be like this: [({:a 1}) ({:b 2} {:b 3} {:b 4})] [({:b 2} {:b 3} {:b 4}) ({:a 5} {:a 6} {:a 7})]

this is achieved by the partition function:

user> (def step2 (partition 2 1 step1))
#'user/step2
user> step2
((({:a 1}) ({:b 2} {:b 3} {:b 4})) 
 (({:b 2} {:b 3} {:b 4}) ({:a 5} {:a 6} {:a 7})))

you have to do something for every pair of groups. You could do it with map:

user> (def step3 (map (fn [[lbounds rbounds]]
                    (map #(vector (last lbounds) %)
                         rbounds))
                  step2))
#'user/step3
user> step3
(([{:a 1} {:b 2}] [{:a 1} {:b 3}] [{:a 1} {:b 4}]) 
 ([{:b 4} {:a 5}] [{:b 4} {:a 6}] [{:b 4} {:a 7}]))

but since you need the concatenated list, rather then the grouped one, you would want to use mapcat instead of map:

user> (def step3 (mapcat (fn [[lbounds rbounds]]
                           (map #(vector (last lbounds) %)
                                rbounds))
                         step2))
#'user/step3
user> step3
([{:a 1} {:b 2}] 
 [{:a 1} {:b 3}] 
 [{:a 1} {:b 4}] 
 [{:b 4} {:a 5}] 
 [{:b 4} {:a 6}] 
 [{:b 4} {:a 7}])

that's the result we want (it almost is, since we just generate vectors, instead of maps).

now you could prettify it with the threading macro:

(->> data0
     (partition-by (comp boolean :a))
     (partition 2 1)
     (mapcat (fn [[lbounds rbounds]]
               (map #(vector (last lbounds) %)
                    rbounds))))

which gives you exactly the same result.

applied to your data it would look almost the same (with another result generating fn)

user> (defn hi-or-lo [item]
        (item :high (item :low)))
#'user/hi-or-lo
user> 
(->> data
     (partition-by (comp boolean :high))
     (partition 2 1)
     (mapcat (fn [[lbounds rbounds]]
               (let [left-bound (last lbounds)
                     left-val (hi-or-lo left-bound)]
                 (map #(let [right-val (hi-or-lo %)
                             diff (Math/abs (- right-val left-val))]
                         {:extremes [left-bound %]
                          :price-range diff
                          :midpoint (+ (min right-val left-val)
                                       (/ diff 2))})
                      rbounds))))
     (clojure.pprint/pprint))

it prints the following:

({:extremes
  [{:high 1.121455, :time "2016-08-03T05:15:00.000000Z"}
   {:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}],
  :price-range 3.6500000000017074E-4,
  :midpoint 1.1212725}
 {:extremes
  [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
   {:high 1.12173, :time "2016-08-03T04:30:00.000000Z"}],
  :price-range 6.399999999999739E-4,
  :midpoint 1.12141}
 {:extremes
  [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
   {:high 1.121925, :time "2016-08-03T00:00:00.000000Z"}],
  :price-range 8.350000000001412E-4,
  :midpoint 1.1215074999999999}
 {:extremes
  [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
   {:high 1.12215, :time "2016-08-02T23:00:00.000000Z"}],
  :price-range 0.001060000000000061,
  :midpoint 1.12162}
 {:extremes
  [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
   {:high 1.12273, :time "2016-08-02T21:15:00.000000Z"}],
  :price-range 0.0016400000000000858,
  :midpoint 1.12191}
 {:extremes
  [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
   {:high 1.12338, :time "2016-08-02T18:15:00.000000Z"}],
  :price-range 0.0022900000000001253,
  :midpoint 1.1222349999999999}
 {:extremes
  [{:high 1.12338, :time "2016-08-02T18:15:00.000000Z"}
   {:low 1.119215, :time "2016-08-02T12:30:00.000000Z"}],
  :price-range 0.004164999999999974,
  :midpoint 1.1212975}
 {:extremes
  [{:high 1.12338, :time "2016-08-02T18:15:00.000000Z"}
   {:low 1.118755, :time "2016-08-02T12:00:00.000000Z"}],
  :price-range 0.004625000000000101,
  :midpoint 1.1210675}
 ...

As an answer the question about "complex data manipulation", i would advice you to look through all the collections' manipulating functions from the clojure core, and then try to decompose any task to the application of those. There are not so many cases when you need something beyond them.

这篇关于Clojure中的复杂数据处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆