在Spark中读取具有多个标题的文本文件 [英] Reading a text file with multiple headers in Spark

查看：71 发布时间：2020/9/4 22:13:42 apache-spark pyspark apache-spark-sql

本文介绍了在Spark中读取具有多个标题的文本文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带有多个标题的文本文件，其中"TEMP"列具有当天的平均温度，其次是记录的数量.如何正确阅读此文本文件以创建DataFrame

I have a text file having multiple headers where "TEMP" column has the average temperature for the day, followed by the number of recordings. How can I read this text file properly to create a DataFrame

STN--- WBAN   YEARMODA    TEMP     
010010 99999  20060101    33.5 23
010010 99999  20060102    35.3 23
010010 99999  20060103    34.4 24
STN--- WBAN   YEARMODA    TEMP     
010010 99999  20060120    35.2 22
010010 99999  20060121    32.2 21
010010 99999  20060122    33.0 22

推荐答案

您可以在RDD
您在文本文件中有一个分隔符，我们假设它是一个space
然后您可以从其中删除标题
删除所有不等于标题的行
然后使用.toDF(col_names)

RDD

You can read the text file as a normal text file in an RDD
You have a separator in the text file, let's assume it's a space
Then you can remove the header from it
Remove all lines inequal to the header
Then convert the RDD to a dataframe using .toDF(col_names)

赞:

rdd = sc.textFile("path/to/file.txt").map(lambda x: x.split(" ")) # step 1 & 2
headers = rdd.first() # Step 3
rdd2 = rdd.filter(lambda x: x != headers)
df = rdd2.toDF(headers) # Step 4

这篇关于在Spark中读取具有多个标题的文本文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Spark中读取具有多个标题的文本文件 [英] Reading a text file with multiple headers in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Spark中读取具有多个标题的文本文件 [英] Reading a text file with multiple headers in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭