从多行记录创建火花数据结构 [英] creating spark data structure from multiline record

查看:28
本文介绍了从多行记录创建火花数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 retrosheet 事件文件读入 spark.事件文件的结构是这样的.

I'm trying to read in retrosheet event file into spark. The event file is structured as such.

id,TEX201403310
version,2
info,visteam,PHI
info,hometeam,TEX
info,site,ARL02
info,date,2014/03/31
info,number,0
info,starttime,1:07PM
info,daynight,day
info,usedh,true
info,umphome,joycj901
info,attendance,49031
start,reveb001,"Ben Revere",0,1,8
start,rollj001,"Jimmy Rollins",0,2,6
start,utlec001,"Chase Utley",0,3,4
start,howar001,"Ryan Howard",0,4,3
start,byrdm001,"Marlon Byrd",0,5,9
id,TEX201404010
version,2
info,visteam,PHI
info,hometeam,TEX

正如您所看到的,每场比赛的事件都会循环返回.

As you can see for each game the events loops back.

我已将文件读入 RDD,然后通过第二个 for 循环为每次迭代添加一个键,这似乎有效.但我希望得到一些反馈,看看是否有使用火花方法来做到这一点的清洁方法.

I've read the file into a RDD, and then via a second for loop added a key for each iteration, which appears to work. But I was hoping to get some feedback on if there was a cleaning way to do this using spark methods.

logFile = '2014TEX.EVA'
event_data = (sc
              .textFile(logfile)
              .collect())

idKey = 0
newevent_list = []
for line in event_dataFile:
    if line.startswith('id'):
        idKey += 1
        newevent_list.append((idKey,line))
    else:
        newevent_list.append((idKey,line))

event_data = sc.parallelize(newevent_list)

推荐答案

PySpark 自 1.1 版 支持 Hadoop 输入格式.您可以使用 textinputformat.record.delimiter 选项来使用自定义格式分隔符,如下所示

PySpark since version 1.1 supports Hadoop Input Formats.You can use textinputformat.record.delimiter option to use a custom format delimiter as below

from operator import itemgetter

retrosheet = sc.newAPIHadoopFile(
    '/path/to/retrosheet/file',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': '\nid,'}
)
(retrosheet
    .filter(itemgetter(1))
    .values()
    .filter(lambda x: x)
    .map(lambda v: (
        v if v.startswith('id') else 'id,{0}'.format(v)).splitlines()))

从 Spark 2.4 开始,您还可以使用 text reader

Since Spark 2.4 you can also read data into DataFrame using text reader

spark.read.option("lineSep", '\nid,').text('/path/to/retrosheet/file')

这篇关于从多行记录创建火花数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆