使用pyspark解析文本文件以在特定位置拆分 [英] Parsing a text file to split at specific positions using pyspark

查看:93
本文介绍了使用pyspark解析文本文件以在特定位置拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个没有由任何字符分隔的文本文件,我想在特定位置拆分它,以便我可以将其转换为数据框".以下 file1.txt 中的示例数据:

I have a text file which is not delimited by any character and I want to split it at specific positions so that I can convert it to a 'dataframe'.Example data in file1.txt below:

1JITENDER33
2VIRENDER28
3BIJENDER37

我想拆分文件,以便位置 0 到 1 进入第一列,位置 2 到 9 进入第二列,位置 10 到 11 进入第三列,以便我最终可以将其转换为 spark 数据帧.

I want to split the file so that positions 0 to 1 goes into first column, positions 2 to 9 goes to second column and 10 to 11 goes to third column so that I can finally convert it into a spark dataframe.

推荐答案

您可以使用以下 python 代码读取输入文件并使用 csv writer 将其分隔,然后可以将其读入数据帧或将其加载到您的hive 外部表.

you can use a below python code to read onto your input file and make it delimited using csv writer and then can read it into dataframe or can load it to your hive external table.

vikrant> cat inputfile
1JITENDER33
2VIRENDER28
3BIJENDER37

import csv
fname_in = '/u/user/vikrant/inputfile'
fname_out = '/u/user/vikrant/outputfile.csv'
cols = [(0,1), (1,9), (9,11)]
with open(fname_in) as fin, open(fname_out, 'wt') as fout:
    writer = csv.writer(fout, delimiter=",", lineterminator="\n")
    for line in fin:
        line = line.rstrip()  # removing the '\n' and other trailing whitespaces
        data = [line[c[0]:c[1]] for c in cols]
        print("data:",data)
        writer.writerow(data)


vikrant> cat outputfile.csv
1,JITENDER,33
2,VIRENDER,28
3,BIJENDER,37

您还可以将此代码作为某个 python 类的函数,然后将该类进一步导入 pyspark 应用程序代码,并且可以将纯文本文件转换为某种 csv 文件格式.如果您需要更多帮助,请告诉我.

you can also make this code as a function to some python class and then further import that class into pyspark application code and can transform your plain text file to some csv file format. let me know incase you need more help on this.

这篇关于使用pyspark解析文本文件以在特定位置拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆