如何使用hadoop Map reduce处理/提取.pst [英] How to process/extract .pst using hadoop Map reduce
问题描述
我正在使用 MAPI 工具(其Microsoft lib和.NET),然后使用 apache TIKA 库来处理和提取 pst 交换服务器,这是不可扩展的.
I am using MAPI tools (Its microsoft lib and in .NET) and then apache TIKA libraries to process and extract the pst from exchange server, which is not scalable.
我如何使用MR方式处理/提取pst ... java是否有可用的工具和库供我在MR作业中使用.任何帮助将是全力以赴.
How can I process/extracts pst using MR way ... Is there any tool, library available in java which I can use in my MR jobs. Any help would be great-full .
Jpst Lib内部使用:PstFile pstFile = new PstFile(java.io.File)
Jpst Lib internally uses: PstFile pstFile = new PstFile(java.io.File)
问题出在 Hadoop API 上,因为我们没有与java.io.File
接近的东西.
And the problem is for Hadoop API's we don't have anything close to java.io.File
.
以下选项始终存在,但效率不高:
Following option is always there but not efficient:
File tempFile = File.createTempFile("myfile", ".tmp");
fs.moveToLocalFile(new Path (<HDFS pst path>) , new Path(tempFile.getAbsolutePath()) );
PstFile pstFile = new PstFile(tempFile);
推荐答案
看看Behemoth(http://digitalpebble.blogspot.com/2011/05/processing-enron-dataset-using-behemoth.html).它结合了Tika和Hadoop.
Take a look at Behemoth (http://digitalpebble.blogspot.com/2011/05/processing-enron-dataset-using-behemoth.html). It combines Tika and Hadoop.
我还由自己的Hadoop + Tika职位撰写.该模式是:
I've also written by own Hadoop + Tika jobs. The pattern is:
- 将所有pst文件包装为序列文件或avro文件.
- 编写仅映射作业,该作业将从avro文件中读取pst文件并将其写入本地磁盘.
- 在文件中运行tika.
- 将tika的输出写回到序列文件中
希望对您有帮助
这篇关于如何使用hadoop Map reduce处理/提取.pst的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!