列数据到Spark结构化流中的嵌套json对象 [英] Column data to nested json object in Spark structured streaming

查看:66
本文介绍了列数据到Spark结构化流中的嵌套json对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我们的应用程序中,我们使用Spark sql获得字段值作为列.我正在尝试弄清楚如何将列值放入嵌套的json对象并推送到Elasticsearch.还可以通过参数化selectExpr中的值来传递给正则表达式吗?

In our application we obtain the field values as columns using Spark sql. Im' trying to figure out how to put the columns values to nested json object and push to Elasticsearch. Also is there a way to parameterise values in selectExpr to pass to the regex?

我们当前正在使用Spark Java API.

We are currently using the Spark Java API.

Dataset<Row> data = rowExtracted.selectExpr("split(value,\"[|]\")[0] as channelId",
                "split(value,\"[|]\")[1] as country",
                "split(value,\"[|]\")[2] as product",
                "split(value,\"[|]\")[3] as sourceId",
                "split(value,\"[|]\")[4] as systemId",
                "split(value,\"[|]\")[5] as destinationId",
                "split(value,\"[|]\")[6] as batchId",
                "split(value,\"[|]\")[7] as orgId",
                "split(value,\"[|]\")[8] as businessId",
                "split(value,\"[|]\")[9] as orgAccountId",
                "split(value,\"[|]\")[10] as orgBankCode",
                "split(value,\"[|]\")[11] as beneAccountId",
                "split(value,\"[|]\")[12] as beneBankId",
                "split(value,\"[|]\")[13] as currencyCode",
                "split(value,\"[|]\")[14] as amount",
                "split(value,\"[|]\")[15] as processingDate",
                "split(value,\"[|]\")[16] as status",
                "split(value,\"[|]\")[17] as rejectCode",
                "split(value,\"[|]\")[18] as stageId",
                "split(value,\"[|]\")[19] as stageStatus",
                "split(value,\"[|]\")[20] as stageUpdatedTime",
                "split(value,\"[|]\")[21] as receivedTime",
                "split(value,\"[|]\")[22] as sendTime"
        );

StreamingQuery query = data.writeStream()
                .outputMode(OutputMode.Append()).format("es").option("checkpointLocation", "C:\\checkpoint")
                .start("spark_index/doc")

实际输出:

{
  "_index": "spark_index",
  "_type": "doc",
  "_id": "test123",
  "_version": 1,
  "_score": 1,
  "_source": {
    "channelId": "test",
    "country": "SG",
    "product": "test",
    "sourceId": "",
    "systemId": "test123",
    "destinationId": "",
    "batchId": "",
    "orgId": "test",
    "businessId": "test",
    "orgAccountId": "test",
    "orgBankCode": "",
    "beneAccountId": "test",
    "beneBankId": "test",
    "currencyCode": "SGD",
    "amount": "53.0000",
    "processingDate": "",
    "status": "Pending",
    "rejectCode": "test",
    "stageId": "123",
    "stageStatus": "Comment",
    "stageUpdatedTime": "2019-08-05 18:11:05.999000",
    "receivedTime": "2019-08-05 18:10:12.701000",
    "sendTime": "2019-08-05 18:11:06.003000"
  }
}

我们需要在节点"txn_summary"下的上述列,例如以下json:

We need the above columns under a node "txn_summary" such as the below json:

预期输出:

{
  "_index": "spark_index",
  "_type": "doc",
  "_id": "test123",
  "_version": 1,
  "_score": 1,
  "_source": {
    "txn_summary": {
      "channelId": "test",
      "country": "SG",
      "product": "test",
      "sourceId": "",
      "systemId": "test123",
      "destinationId": "",
      "batchId": "",
      "orgId": "test",
      "businessId": "test",
      "orgAccountId": "test",
      "orgBankCode": "",
      "beneAccountId": "test",
      "beneBankId": "test",
      "currencyCode": "SGD",
      "amount": "53.0000",
      "processingDate": "",
      "status": "Pending",
      "rejectCode": "test",
      "stageId": "123",
      "stageStatus": "Comment",
      "stageUpdatedTime": "2019-08-05 18:11:05.999000",
      "receivedTime": "2019-08-05 18:10:12.701000",
      "sendTime": "2019-08-05 18:11:06.003000"
    }
  }
}

推荐答案

将所有列添加到顶级结构中应该会提供预期的输出.在Scala中:

Adding all columns to a top level struct should give the expected output. In Scala:

import org.apache.spark.sql.functions._

data.select(struct(data.columns:_*).as("txn_summary"))

在Java中,我怀疑是这样的:

In Java I would suspect it it would be:

import org.apache.spark.sql.functions.struct;

data.select(struct(data.columns()).as("txn_summary"));

这篇关于列数据到Spark结构化流中的嵌套json对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆