无法读取json文件:使用Java的Spark结构化流 [英] Not able to read json files: Spark Structured Streaming using java

查看:89
本文介绍了无法读取json文件:使用Java的Spark结构化流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Python脚本,它每分钟从NYSE的新文件(单行)中获取股票数据(如下所示).它包含4种股票的数据-MSFT,ADBE,GOOGL和FB,格式如下json

I have a python script which is getting stock data(as below) from NYSE every minute in a new file(single line). It contains data of 4 stocks - MSFT, ADBE, GOOGL and FB, as the below json format

[{"symbol": "MSFT", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "126.0800", "high": "126.1000", "low": "126.0500", "close": "126.0750", "volume": "57081"}}, {"symbol": "ADBE", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "279.2900", "high": "279.3400", "low": "279.2600", "close": "279.3050", "volume": "12711"}}, {"symbol": "GOOGL", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "1166.4100", "high": "1166.7400", "low": "1166.2900", "close": "1166.7400", "volume": "8803"}}, {"symbol": "FB", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "192.4200", "high": "192.5000", "low": "192.3600", "close": "192.4800", "volume": "33490"}}]

我正在尝试将此文件流读取到Spark Streaming数据帧中.但是我无法为其定义适当的架构.到目前为止,我们已经浏览了互联网并进行了以下操作

I'm trying to read this file stream into a Spark Streaming dataframe. But I'm not able to define the proper schema for it. Looked into the internet and done the following so far

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;



public class Driver1 {

    public static void main(String args[]) throws InterruptedException, StreamingQueryException {


        SparkSession session = SparkSession.builder().appName("Spark_Streaming").master("local[2]").getOrCreate();
        Logger.getLogger("org").setLevel(Level.ERROR);


        StructType priceData = new StructType()
                .add("open", DataTypes.DoubleType)
                .add("high", DataTypes.DoubleType)
                .add("low", DataTypes.DoubleType)
                .add("close", DataTypes.DoubleType)
                .add("volume", DataTypes.LongType);

        StructType schema = new StructType()
                .add("symbol", DataTypes.StringType)
                .add("timestamp", DataTypes.StringType)
                .add("stock", priceData);


        Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
        rawData.printSchema();
        rawData.writeStream().format("console").start().awaitTermination();
        session.close();        

    }

}

我得到的输出是这个-

root
 |-- symbol: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- stock: struct (nullable = true)
 |    |-- open: double (nullable = true)
 |    |-- high: double (nullable = true)
 |    |-- low: double (nullable = true)
 |    |-- close: double (nullable = true)
 |    |-- volume: long (nullable = true)

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-------------------+-----+
|symbol|          timestamp|stock|
+------+-------------------+-----+
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
+------+-------------------+-----+

我什至尝试先将json字符串读取为文本文件,然后再应用架构(就像通过Kafka-Streaming完成的一样)...

I have even tried first reading the json string as a text file and then applying the schema(like it is done with the Kafka-Streaming)...

  Dataset<Row> rawData = session.readStream().format("text").load("/home/abhinavrawat/streamingData/data/*");
    Dataset<Row> raw2 = rawData.select(org.apache.spark.sql.functions.from_json(rawData.col("value"),schema)); 
raw2.writeStream().format("console").start().awaitTermination();

在下面的输出中,在这种情况下,将 rawData 数据帧作为字符串fromat,

Getting below output, in this case, the rawData dataframe as the json data in string fromat,

+--------------------+
|jsontostructs(value)|
+--------------------+
|                null|
|                null|
|                null|
|                null|
|                null|

请帮我弄清楚.

推荐答案

弄清楚了,记住以下两点-

Just figured it out, Keep the following two things in mind-

  1. 在定义架构时,请确保名称和顺序字段与json文件中的字段完全相同.

  1. While defining the schema make sure you name and order the fields exactly the same as in your json file.

最初,仅对所有字段使用 StringType ,您可以应用转换将其更改回某些特定的数据类型.

Initially, use only StringType for all your fields, you can apply a transformation to change it back to some specific data type.

这对我有用-

    StructType priceData = new StructType()
            .add("open", DataTypes.StringType)
            .add("high", DataTypes.StringType)
            .add("low", DataTypes.StringType)
            .add("close", DataTypes.StringType)
            .add("volume", DataTypes.StringType);

    StructType schema = new StructType()
            .add("symbol", DataTypes.StringType)
            .add("timestamp", DataTypes.StringType)
            .add("priceData", priceData);


    Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
    rawData.writeStream().format("console").start().awaitTermination();
    session.close();

查看输出-

+------+-------------------+--------------------+
|symbol|          timestamp|           priceData|
+------+-------------------+--------------------+
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
+------+-------------------+--------------------+

您现在可以使用 priceData.open priceData.close 等来使priceData列扁平化.

You can now flatten the priceData column using priceData.open, priceData.close etc.

这篇关于无法读取json文件:使用Java的Spark结构化流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆