如何以一个字符串读取整个文件 [英] How to read whole file in one string
问题描述
我想在pyspark.lf中读取json或xml文件,我的文件分成多行
I want to read json or xml file in pyspark.lf my file is split in multiple line in
rdd= sc.textFIle(json or xml)
输入
{
" employees":
[
{
"firstName":"John",
"lastName":"Doe"
},
{
"firstName":"Anna"
]
}
输入分布在多行上.
预期输出{"employees:[{"firstName:"John",......]}
如何使用pyspark在一行中获取完整文件?
How to get the complete file in a single line using pyspark?
请帮助我,我是新来的人.
Please help me I am new to spark.
推荐答案
有3种方法(我发明了第3种,前两种是标准的内置Spark函数),这里的解决方案在PySpark中:
There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark:
textFile,wholeTextFile和带标签的textFile(键=文件,值=从文件开始的1行.这是两种给定的解析文件方式之间的混合).
textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files).
1.)textFile
输入:
rdd = sc.textFile('/home/folder_with_text_files/input_file')
输出:每个条目包含1行文件的数组,即. [第1行,第2行,...]
output: array containing 1 line of file as each entry ie. [line1, line2, ...]
2.)WholeTextFiles
输入:
rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')
输出:元组数组,第一项是具有文件路径的键",第二项包含一个文件的全部内容,即.
output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie.
[(u'file:/home/folder_with_text_files/',u'file1_contents'),(u'file:/home/folder_with_text_files/',file2_contents,...]
[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]
3.)标签"文本文件
输入:
import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)
for filename in glob.glob(Data_File + "/*"):
Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)
输出:数组,每个条目包含一个使用filename-as-key的元组,值=文件的每一行. (从技术上讲,使用此方法,除了实际的文件路径名之外,您还可以使用其他键-可能是散列表示形式,以保存在内存中). IE.
output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.
[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
...]
您还可以将其重新组合为行列表:
You can also recombine either as a list of lines:
Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()
[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]
或将整个文件重新组合为单个字符串(在此示例中,结果与从WholeTextFiles获得的结果相同,但从文件路径中去除了字符串"file:".)
Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):
Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()
这篇关于如何以一个字符串读取整个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!