如何以一个字符串读取整个文件 [英] How to read whole file in one string

查看：119 发布时间：2020/9/4 3:29:01 apache-spark apache-spark-sql

本文介绍了如何以一个字符串读取整个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在pyspark.lf中读取json或xml文件，我的文件分成多行

I want to read json or xml file in pyspark.lf my file is split in multiple line in

rdd= sc.textFIle(json or xml)

输入

{
" employees":
[
 {
 "firstName":"John",
 "lastName":"Doe" 
},
 { 
"firstName":"Anna"
  ]
}

输入分布在多行上.

预期输出{"employees:[{"firstName:"John",......]}

如何使用pyspark在一行中获取完整文件?

How to get the complete file in a single line using pyspark?

请帮助我，我是新来的人.

Please help me I am new to spark.

推荐答案

有3种方法(我发明了第3种，前两种是标准的内置Spark函数)，这里的解决方案在PySpark中:

There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark:

textFile，wholeTextFile和带标签的textFile(键=文件，值=从文件开始的1行.这是两种给定的解析文件方式之间的混合).

textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files).

1.)textFile

输入: rdd = sc.textFile('/home/folder_with_text_files/input_file')

输出:每个条目包含1行文件的数组，即. [第1行，第2行，...]

output: array containing 1 line of file as each entry ie. [line1, line2, ...]

2.)WholeTextFiles

输入: rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')

输出:元组数组，第一项是具有文件路径的键"，第二项包含一个文件的全部内容，即.

output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie.

[(u'file:/home/folder_with_text_files/'，u'file1_contents')，(u'file:/home/folder_with_text_files/'，file2_contents，...]

[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]

3.)标签"文本文件

输入:

import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)

for filename in glob.glob(Data_File + "/*"):
    Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)

输出:数组，每个条目包含一个使用filename-as-key的元组，值=文件的每一行. (从技术上讲，使用此方法，除了实际的文件路径名之外，您还可以使用其他键-可能是散列表示形式，以保存在内存中). IE.

output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.

[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
 ('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
  ...]

您还可以将其重新组合为行列表:

You can also recombine either as a list of lines:

Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
 ('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]

或将整个文件重新组合为单个字符串(在此示例中，结果与从WholeTextFiles获得的结果相同，但从文件路径中去除了字符串"file:".)

Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):

Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()

这篇关于如何以一个字符串读取整个文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何以一个字符串读取整个文件 [英] How to read whole file in one string

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何以一个字符串读取整个文件 [英] How to read whole file in one string

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭