clojure-strng-concat与在地图序列中的分组依据 [英] clojure - strng-concat with group by in sequences of maps

查看:39
本文介绍了clojure-strng-concat与在地图序列中的分组依据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从jdbc源获得输入数据,例如:

Given input data from a jdbc source such as this:

  (def input-data
    [{:doc_id 1 :doc_seq 1  :doc_content "this is a very long "}
    {:doc_id 1 :doc_seq 2  :doc_content "sentence from a mainframe "}
    {:doc_id 1 :doc_seq 3  :doc_content "system that was built before i was "}
    {:doc_id 1 :doc_seq 4  :doc_content "born."}
    {:doc_id 2 :doc_seq 1  :doc_content "this is a another very long "}
    {:doc_id 2 :doc_seq 2  :doc_content "sentence from the same mainframe "}
    {:doc_id 3 :doc_seq 1  :doc_content "Ok here we are again. "}
    {:doc_id 3 :doc_seq 2  :doc_content "The mainframe only had 40 char per field so"}
    {:doc_id 3 :doc_seq 3  :doc_content "they broke it into multiple rows "}
    {:doc_id 3 :doc_seq 4  :doc_content "which seems to be common"}
    {:doc_id 3 :doc_seq 5  :doc_content " for the time. "}
    {:doc_id 3 :doc_seq 6  :doc_content "thanks for your help."}])

我想按 doc id 分组,并字符串连接 doc_content ,所以我的输出应如下所示:

I want to group by doc id, and string-concat the doc_content, so my output would look like this:

  [{:doc_id 1 :doc_content "this is a very long sentence from a mainfram system that was built before i was born."}
   {:doc_id 2 :doc_content "this is a another very long sentence ... clip..."}
   {:doc_id 3 :doc_content "... clip..."}]

我当时在考虑使用 group-by ,但是会输出地图,并且我需要
输出一些惰性数据,因为输入数据集可能非常大。也许我可以运行 group-by reduce-kv 的一些组合来获取我想要的东西。 。或者,如果我可以强迫它变得懒惰,则可以使用频率

I was thinking of using group-by however that outputs a map, and I need to output something lazy as the input data set could be very large. Maybe I could run group-by and some compbination of reduce-kv to get what i'm looking for... or maybe something with frequencies if i can coerce it to be lazy.

我可以保证排序我将通过(通过sql)在 doc_id doc_seq 上下订单,因此该程序唯一负责的是for是聚合/字符串连接部分。我可能会在整个序列中输入大量数据,但是该序列中的特定 doc_id 应该只是几十个 doc_seq

I can guarantee that it will be sorted; I will put the order by (through sql) on doc_id, and doc_seq, so the only thing this program is responsible for is the aggregate/string-concat part. I will likely have large input data for the whole sequence, but a specific doc_id in that sequence should only be a few dozen doc_seq.

任何提示,

推荐答案

partition-by 是惰性的,并且只要每个 doc序列都适合内存,这应该可以工作:

partition-by is lazy, and as long as each doc sequence fits in memory, this should work:

(defn collapse-docs [docs]
  (apply merge-with
         (fn [l r]
           (if (string? r)
             (str l r)
             r))
         docs))

(sequence ;; you may want to use eduction here, depending on use case
  (comp
    (partition-by :doc_id)
    (map collapse-docs))
  input-data)
=>
({:doc_id 1,
  :doc_seq 4,
  :doc_content "this is a very long sentence from a mainframe system that was built before i was born."}
  {:doc_id 2, :doc_seq 2, :doc_content "this is a another very long sentence from the same mainframe "}
  {:doc_id 3,
   :doc_seq 6,
   :doc_content "Ok here we are again. The mainframe only had 40 char per field sothey broke it into multiple rows which seems to be common for the time. thanks for your help."})

这篇关于clojure-strng-concat与在地图序列中的分组依据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆