在 RDD 中 Spark 获取文件名 [英] Spark-Obtaining file name in RDDs

查看：83 发布时间：2021/11/12 5:32:42 apache-spark

本文介绍了在 RDD 中 Spark 获取文件名的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试处理 4 个每天都在增长的文本文件目录.我需要做的是，如果有人试图搜索发票编号，我应该向他们提供包含该编号的文件列表.

I am trying to process 4 directories of text files that keep growing every day. What I need to do is, if somebody is trying to search for an invoice number, I should give them the list of files which has it.

通过将文本文件加载为 RDD，我能够映射和减少文本文件中的值.但是如何获取文件名和其他文件属性呢?

I was able to map and reduce the values in text files by loading them as RDD. But how can I obtain the file name and other file attributes?

推荐答案

从 Spark 1.6 开始，您可以将 text 数据源和 input_file_name 函数组合如下:

Since Spark 1.6 you can combine text data source and input_file_name function as follows:

Scala:

import org.apache.spark.sql.functions.input_file_name

val inputPath: String = ???

spark.read.text(inputPath)
  .select(input_file_name, $"value")
  .as[(String, String)] // Optionally convert to Dataset
  .rdd // or RDD

Python:

(2.x 之前的版本有问题，转换为 RDD 时可能不会保留名称):

from pyspark.sql.functions import input_file_name

(spark.read.text(input_path)
    .select(input_file_name(), "value"))
    .rdd)

这也可以与其他输入格式一起使用.

This can be used with other input formats as well.

这篇关于在 RDD 中 Spark 获取文件名的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 RDD 中 Spark 获取文件名 [英] Spark-Obtaining file name in RDDs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 RDD 中 Spark 获取文件名 [英] Spark-Obtaining file name in RDDs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭