如何以一个字符串读取整个文件 [英] How to read whole file in one string

查看:119
本文介绍了如何以一个字符串读取整个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在pyspark.lf中读取json或xml文件,我的文件分成多行

I want to read json or xml file in pyspark.lf my file is split in multiple line in

rdd= sc.textFIle(json or xml) 

输入

{
" employees":
[
 {
 "firstName":"John",
 "lastName":"Doe" 
},
 { 
"firstName":"Anna"
  ]
}

输入分布在多行上.

预期输出{"employees:[{"firstName:"John",......]}

如何使用pyspark在一行中获取完整文件?

How to get the complete file in a single line using pyspark?

请帮助我,我是新来的人.

Please help me I am new to spark.

推荐答案

有3种方法(我发明了第3种,前两种是标准的内置Spark函数),这里的解决方案在PySpark中:

There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark:

textFile,wholeTextFile和带标签的textFile(键=文件,值=从文件开始的1行.这是两种给定的解析文件方式之间的混合).

textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files).

1.)textFile

输入: rdd = sc.textFile('/home/folder_with_text_files/input_file')

输出:每个条目包含1行文件的数组,即. [第1行,第2行,...]

output: array containing 1 line of file as each entry ie. [line1, line2, ...]

2.)WholeTextFiles

输入: rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')

输出:元组数组,第一项是具有文件路径的键",第二项包含一个文件的全部内容,即.

output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie.

[(u'file:/home/folder_with_text_files/',u'file1_contents'),(u'file:/home/folder_with_text_files/',file2_contents,...]

[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]

3.)标签"文本文件

输入:

import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)

for filename in glob.glob(Data_File + "/*"):
    Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)

输出:数组,每个条目包含一个使用filename-as-key的元组,值=文件的每一行. (从技术上讲,使用此方法,除了实际的文件路径名之外,您还可以使用其他键-可能是散列表示形式,以保存在内存中). IE.

output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.

[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
 ('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
  ...]


您还可以将其重新组合为行列表:


You can also recombine either as a list of lines:

Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
 ('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]

或将整个文件重新组合为单个字符串(在此示例中,结果与从WholeTextFiles获得的结果相同,但从文件路径中去除了字符串"file:".)

Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):

Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()

这篇关于如何以一个字符串读取整个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆