如何处理/提取.pst usig hadoop地图缩小 [英] how to process/extract .pst usig hadoop Map reduce
问题描述
我使用 MAPI 工具(其微软lib和.NET),然后使用 apache TIKA 库来处理和提取 pst 交换服务器,这是不可扩展的。
我如何使用MR方式处理/提取pst ...是否有任何工具可用于Java,我可以在我的MR作业中使用它。任何帮助都会很棒。
Jpst Lib内部使用: PstFile pstFile = new PstFile(java.io.File)
问题在于 Hadoop API ,我们没有任何东西接近 java.io.File $ c $
以下选项始终存在但效率不高:
File tempFile = File.createTempFile(myfile,.tmp);
fs.moveToLocalFile(new Path(< HDFS pst path>),new Path(tempFile.getAbsolutePath()));
PstFile pstFile = new PstFile(tempFile);
请看Behemoth(http:// digitalpebble .blogspot.com / 2011/05 /处理安然-数据集-使用-behemoth.html)。它结合了Tika和Hadoop。
我也是通过自己的Hadoop + Tika作业编写的。这种模式是:
- 将所有pst文件包装为序列或avro文件。
- 编写一个只读作pst文件的map作业,形成avro文件并将其写入本地磁盘。
- 在文件中运行tika。
- 将tika的输出写回到序列文件中
希望help.s
I am using MAPI tools (Its microsoft lib and in .NET) and then apache TIKA libraries to process and extract the pst from exchange server, which is not scalable.
How can I process/extracts pst using MR way ... Is there any tool, library available in java which I can use in my MR jobs. Any help would be great-full .
Jpst Lib internally uses: PstFile pstFile = new PstFile(java.io.File)
And the problem is for Hadoop API's we don't have anything close to java.io.File
.
Following option is always there but not efficient:
File tempFile = File.createTempFile("myfile", ".tmp");
fs.moveToLocalFile(new Path (<HDFS pst path>) , new Path(tempFile.getAbsolutePath()) );
PstFile pstFile = new PstFile(tempFile);
Take a look at Behemoth (http://digitalpebble.blogspot.com/2011/05/processing-enron-dataset-using-behemoth.html). It combines Tika and Hadoop.
I've also written by own Hadoop + Tika jobs. The pattern is:
- Wrap all the pst files into sequencence or avro files.
- Write a map only job that reads the pst files form the avro files and writes it to the local disk.
- Run tika across the files.
- Write the output of tika back into a sequence file
Hope that help.s
这篇关于如何处理/提取.pst usig hadoop地图缩小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!