在 RDD 中 Spark 获取文件名 [英] Spark-Obtaining file name in RDDs
问题描述
我正在尝试处理 4 个每天都在增长的文本文件目录.我需要做的是,如果有人试图搜索发票编号,我应该向他们提供包含该编号的文件列表.
I am trying to process 4 directories of text files that keep growing every day. What I need to do is, if somebody is trying to search for an invoice number, I should give them the list of files which has it.
通过将文本文件加载为 RDD,我能够映射和减少文本文件中的值.但是如何获取文件名和其他文件属性呢?
I was able to map and reduce the values in text files by loading them as RDD. But how can I obtain the file name and other file attributes?
推荐答案
从 Spark 1.6 开始,您可以将 text
数据源和 input_file_name
函数组合如下:
Since Spark 1.6 you can combine text
data source and input_file_name
function as follows:
Scala:
import org.apache.spark.sql.functions.input_file_name
val inputPath: String = ???
spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)] // Optionally convert to Dataset
.rdd // or RDD
Python:
(2.x 之前的版本有问题,转换为 RDD 时可能不会保留名称):
from pyspark.sql.functions import input_file_name
(spark.read.text(input_path)
.select(input_file_name(), "value"))
.rdd)
这也可以与其他输入格式一起使用.
This can be used with other input formats as well.
这篇关于在 RDD 中 Spark 获取文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!