Mapreduce Hadoop的PDF输入格式 [英] PDF input format for Mapreduce Hadoop
问题描述
您好,我使用PDFBOX外部函数库解析mapreduce中的pdf输入文件,但我收到以下错误。
$ b
错误:抛出java.lang.ClassNotFoundException:
org.apache.pdfbox.pdmodel.PDDocument在
$ java.net.URLClassLoader的1.run(URLClassLoader.java:366)
在java.net.URLClassLoader的$ 1.run(URLClassLoader.java:355)
在java.security.AccessController.doPrivileged(本机方法)
在java.net.URLClassLoader.findClass(URLClassLoader.java:354)
。在java.lang.ClassLoader.loadClass(ClassLoader.java:425)
在sun.misc.Launcher $ AppClassLoader.loadClass(Launcher.java:308)
在java.lang.ClassLoader.loadClass(类加载器。 Java的:358)
在com.nielsen.grfe.processor.mapreduce.Pdfparser $ PdfLineRecordReader.initialize(Pdfparser.java:109)
在org.apache.hadoop.mapred.MapTask $ NewTrackingRecordReader.initialize( MapTask.java:548)
在org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache .hadoop.mapred.YarnChild $ 2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java :415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
我正在使用以下依赖项:
<依赖性>
< groupId> org.apache.pdfbox< / groupId>
< artifactId> pdfbox< / artifactId>
< version> 1.8.10< / version>
< /依赖关系>
< dependency>
< groupId> org.apache.pdfbox< / groupId>
< artifactId> fontbox< / artifactId>
< version> 1.8.5< / version>
< /依赖关系>
<1>将pdfbox的jar文件放在hadoop lib (使库库在运行时可用于hadoop)。
2)重新启动hadoop集群。
或者
1)确保您的pdfbox库可用于hadoop,方法是将其放入分布式缓存中。
Hi I anm using PDFBOX external library for parsing the pdf input file in mapreduce,but i am getting the following error.
Error: java.lang.ClassNotFoundException: org.apache.pdfbox.pdmodel.PDDocument at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at com.nielsen.grfe.processor.mapreduce.Pdfparser$PdfLineRecordReader.initialize(Pdfparser.java:109) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
I am using the following dependency
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>1.8.10</version>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>fontbox</artifactId>
<version>1.8.5</version>
</dependency>
1) Place the jar file of pdfbox in hadoop lib folder too.(make library jar available to hadoop at runtime).
2) Restart hadoop cluster.
Or
1) Make sure that your pdfbox library is available to hadoop by placing it in distributed cache.
这篇关于Mapreduce Hadoop的PDF输入格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!