如何在一个字符串中读取整个文件 [英] How to read whole file in one string

查看:23
本文介绍了如何在一个字符串中读取整个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 pyspark.lf 中读取 json 或 xml 文件,我的文件被分成多行

I want to read json or xml file in pyspark.lf my file is split in multiple line in

rdd= sc.textFIle(json or xml) 

输入

{
" employees":
[
 {
 "firstName":"John",
 "lastName":"Doe" 
},
 { 
"firstName":"Anna"
  ]
}

输入分布在多行中.

预期输出 {"employees:[{"firstName:"John",......]}

如何使用 pyspark 在一行中获取完整文件?

How to get the complete file in a single line using pyspark?

推荐答案

有3种方式(我发明了第3种,前两种是标准的内置Spark函数),解决方案在PySpark:

There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark:

textFile、wholeTextFile 和带标签的 textFile(键 = 文件,值 = 文件中的 1 行.这是两种解析文件的方法的混合).

textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files).

1.) 文本文件

输入:rdd = sc.textFile('/home/folder_with_text_files/input_file')

输出:包含 1 行文件作为每个条目的数组,即.[line1, line2, ...]

output: array containing 1 line of file as each entry ie. [line1, line2, ...]

2.) WholeTextFiles

输入:rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')

输出:元组数组,第一项是文件路径的键",第二项包含1个文件的全部内容,即.

output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie.

[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]

[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]

3.) 标记"文本文件

输入:

import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)

for filename in glob.glob(Data_File + "/*"):
    Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)

输出:数组,每个条目包含一个元组,使用文件名作为键,值=文件的每一行.(从技术上讲,使用这种方法,除了实际的文件路径名称之外,您还可以使用不同的键 - 可能是一种散列表示以节省内存).IE.

output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.

[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
 ('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
  ...]

<小时>

您也可以重新组合为行列表:


You can also recombine either as a list of lines:

Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
 ('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]

或者将整个文件重新组合回单个字符串(在本例中,结果与从 WholeTextFiles 得到的结果相同,但字符串file:"从文件路径中删除.):

Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):

Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()

这篇关于如何在一个字符串中读取整个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆