如何在Spark中读取固定字符长度格式的文件 [英] how to read a fixed character length format file in spark

查看:84
本文介绍了如何在Spark中读取固定字符长度格式的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据如下.

[Row(_c0='ACW00011604  17.1167  -61.7833   10.1    ST JOHNS COOLIDGE FLD                       '),
 Row(_c0='ACW00011647  17.1333  -61.7833   19.2    ST JOHNS                                    '),
 Row(_c0='AE000041196  25.3330   55.5170   34.0    SHARJAH INTER. AIRP            GSN     41196')]

我已经按照文档定义了具有正确列宽等的schema_stn.我使用pyspark将其读入数据框的代码如下:

I have defined the schema_stn with correct column widths etc as per the documentation. My code for reading it into a dataframe using pyspark is as under:

df.select(
    df.value.substr(1, 11).alias('id'),
    df.value.substr(13, 20).alias('LATITUDE'),
    df.value.substr(22, 30).alias('LONGITUDE'),
    df.value.substr(32, 37).alias('LATITUDE'),
    df.value.substr(39, 40).alias('LONGITUDE'),
    df.value.substr(42, 71).alias('LATITUDE'),
    df.value.substr(73, 75).alias('LONGITUDE'),
    df.value.substr(77, 79).alias('LATITUDE'),
    df.value.substr(81, 85).alias('LONGITUDE'))

df = sqlContext.read.csv("hdfs:////data/stn") 
df = (sqlContext.read.format("csv")
        .schema(schema_stn)
        .option("delimiter", " ")
        .load("hdfs:////data/stn")
        )
df.cache()
df.show(3)

我得到以下输出.

In [62]: df.show(3)
+-----------+--------+---------+---------+--------+-------+--------+------------+------+
|         ID|LATITUDE|LONGITUDE|ELEVATION|   STATE|   NAME|GSN FLAG|HCN/CRN FLAG|WMO ID|
+-----------+--------+---------+---------+--------+-------+--------+------------+------+
|ACW00011604|    null|  17.1167|     null|-61.7833|   null|    null|        10.1|  null|
|ACW00011647|    null|  17.1333|     null|-61.7833|   null|    null|        19.2|  null|
|AE000041196|    null|   25.333|     null|    null|55.5170|    null|        null|  34.0|
+-----------+--------+---------+---------+--------+-------+--------+------------+------+

我无法删除这些"null"(代表空白).请在这里缺少什么.

I am not able to remove these 'null' (which represent the whitespace.) What am missing here please.

推荐答案

您需要阅读为文本行.否则分隔符错误

You need to read as lines of text. Otherwise the delimiter is wrong

df = spark.read.text("hdfs:////data/stn") 

然后然后解析

df = df.select(
    df.value.substr(1, 11).alias('id'),
    df.value.substr(13, 20).alias('LATITUDE'),
    df.value.substr(22, 30).alias('LONGITUDE'),
    df.value.substr(32, 37).alias('c3'),
    df.value.substr(39, 40).alias('c4'),
    df.value.substr(42, 71).alias('c5'),
    df.value.substr(73, 75).alias('c6'),
    df.value.substr(77, 79).alias('c7'),
    df.value.substr(81, 85).alias('c8'))
df.show(3)

这篇关于如何在Spark中读取固定字符长度格式的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆