如何在Spark中读取固定字符长度格式的文件 [英] how to read a fixed character length format file in spark
本文介绍了如何在Spark中读取固定字符长度格式的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
数据如下.
[Row(_c0='ACW00011604 17.1167 -61.7833 10.1 ST JOHNS COOLIDGE FLD '),
Row(_c0='ACW00011647 17.1333 -61.7833 19.2 ST JOHNS '),
Row(_c0='AE000041196 25.3330 55.5170 34.0 SHARJAH INTER. AIRP GSN 41196')]
我已经按照文档定义了具有正确列宽等的schema_stn.我使用pyspark将其读入数据框的代码如下:
I have defined the schema_stn with correct column widths etc as per the documentation. My code for reading it into a dataframe using pyspark is as under:
df.select(
df.value.substr(1, 11).alias('id'),
df.value.substr(13, 20).alias('LATITUDE'),
df.value.substr(22, 30).alias('LONGITUDE'),
df.value.substr(32, 37).alias('LATITUDE'),
df.value.substr(39, 40).alias('LONGITUDE'),
df.value.substr(42, 71).alias('LATITUDE'),
df.value.substr(73, 75).alias('LONGITUDE'),
df.value.substr(77, 79).alias('LATITUDE'),
df.value.substr(81, 85).alias('LONGITUDE'))
df = sqlContext.read.csv("hdfs:////data/stn")
df = (sqlContext.read.format("csv")
.schema(schema_stn)
.option("delimiter", " ")
.load("hdfs:////data/stn")
)
df.cache()
df.show(3)
我得到以下输出.
In [62]: df.show(3)
+-----------+--------+---------+---------+--------+-------+--------+------------+------+
| ID|LATITUDE|LONGITUDE|ELEVATION| STATE| NAME|GSN FLAG|HCN/CRN FLAG|WMO ID|
+-----------+--------+---------+---------+--------+-------+--------+------------+------+
|ACW00011604| null| 17.1167| null|-61.7833| null| null| 10.1| null|
|ACW00011647| null| 17.1333| null|-61.7833| null| null| 19.2| null|
|AE000041196| null| 25.333| null| null|55.5170| null| null| 34.0|
+-----------+--------+---------+---------+--------+-------+--------+------------+------+
我无法删除这些"null"(代表空白).请在这里缺少什么.
I am not able to remove these 'null' (which represent the whitespace.) What am missing here please.
推荐答案
您需要阅读为文本行.否则分隔符错误
You need to read as lines of text. Otherwise the delimiter is wrong
df = spark.read.text("hdfs:////data/stn")
然后然后解析
df = df.select(
df.value.substr(1, 11).alias('id'),
df.value.substr(13, 20).alias('LATITUDE'),
df.value.substr(22, 30).alias('LONGITUDE'),
df.value.substr(32, 37).alias('c3'),
df.value.substr(39, 40).alias('c4'),
df.value.substr(42, 71).alias('c5'),
df.value.substr(73, 75).alias('c6'),
df.value.substr(77, 79).alias('c7'),
df.value.substr(81, 85).alias('c8'))
df.show(3)
这篇关于如何在Spark中读取固定字符长度格式的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文