是否可以使用Apache Spark读取pdf /音频/视频文件(非结构化数据)? [英] Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
问题描述
是否可以使用Apache Spark读取pdf /音频/视频文件(非结构化数据)?
例如,我有数以千计的pdf发票,我想从这些数据中读取数据并对其执行一些分析。我必须采取哪些步骤来处理非结构化数据?
sparkContext.binaryFiles
以二进制格式加载文件,然后使用 map
将值映射到其他格式 - 例如,使用Apache Tika或Apache POI解析二进制文件。 伪代码:
val rawFile = sparkContext.binaryFiles(...
val ready = rawFile.map(这里用其他框架解析
重要的是,解析必须在我的答案中用前面提到的其他框架来完成。Map将获得InputStream作为参数
Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark? For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data?
Yes, it is. Use sparkContext.binaryFiles
to load files in binary format and then use map
to map value to some other format - for example, parse binary with Apache Tika or Apache POI.
Pseudocode:
val rawFile = sparkContext.binaryFiles(...
val ready = rawFile.map ( here parsing with other framework
What is important, parsing must be done with other framework like mentioned previously in my answer. Map will get InputStream as an argument
这篇关于是否可以使用Apache Spark读取pdf /音频/视频文件(非结构化数据)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!