如何在一个字符串中读取整个文件 [英] How to read whole file in one string
问题描述
我想在 pyspark.lf 中读取 json 或 xml 文件,我的文件被分成多行
I want to read json or xml file in pyspark.lf my file is split in multiple line in
rdd= sc.textFIle(json or xml)
输入
{
" employees":
[
{
"firstName":"John",
"lastName":"Doe"
},
{
"firstName":"Anna"
]
}
输入分布在多行中.
预期输出 {"employees:[{"firstName:"John",......]}
如何使用 pyspark 在一行中获取完整文件?
How to get the complete file in a single line using pyspark?
推荐答案
有3种方式(我发明了第3种,前两种是标准的内置Spark函数),解决方案在PySpark:
There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark:
textFile、wholeTextFile 和带标签的 textFile(键 = 文件,值 = 文件中的 1 行.这是两种解析文件的方法的混合).
textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files).
1.) 文本文件
输入:rdd = sc.textFile('/home/folder_with_text_files/input_file')
输出:包含 1 行文件作为每个条目的数组,即.[line1, line2, ...]
output: array containing 1 line of file as each entry ie. [line1, line2, ...]
2.) WholeTextFiles
输入:rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')
输出:元组数组,第一项是文件路径的键",第二项包含1个文件的全部内容,即.
output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie.
[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]
[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]
3.) 标记"文本文件
输入:
import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)
for filename in glob.glob(Data_File + "/*"):
Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)
输出:数组,每个条目包含一个元组,使用文件名作为键,值=文件的每一行.(从技术上讲,使用这种方法,除了实际的文件路径名称之外,您还可以使用不同的键 - 可能是一种散列表示以节省内存).IE.
output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.
[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
...]
<小时>
您也可以重新组合为行列表:
You can also recombine either as a list of lines:
Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()
[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]
或者将整个文件重新组合回单个字符串(在本例中,结果与从 WholeTextFiles 得到的结果相同,但字符串file:"从文件路径中删除.):
Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):
Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()
这篇关于如何在一个字符串中读取整个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!