阿帕奇提卡1.11上星火NoClassDeftFoundError [英] Apache Tika 1.11 on Spark NoClassDeftFoundError

查看:407
本文介绍了阿帕奇提卡1.11上星火NoClassDeftFoundError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在星火顶部使用Apache蒂卡。然而,我遇到的配置问题。此刻我最好的猜测是,依赖关系(其中蒂卡有很多......)没有与JAR火花捆绑在一起。
如果这种直觉是正确的,我不能确定前进的最佳路径。但我也不能肯定,甚至是我的问题。

以下是pretty简单的火花工作,这编译,但碰到一个运行时错误时,得到的蒂卡实例。

我的的pom.xml 如下:

 <项目>
  <&的groupId GT; tika.test< /的groupId>
  <&的artifactId GT; tikaTime< / artifactId的>
  < modelVersion> 4.0.0< / modelVersion>
  <名称>&TikaTime LT; /名称>
  <包装和GT;&罐子LT; /包装>
  <&版GT; 1.0 LT; /版本>
  <依赖和GT;
    <&依赖性GT; <! - 星火依赖 - >
      <&的groupId GT; org.apache.spark< /的groupId>
      <&的artifactId GT;火花core_2.10< / artifactId的>
      <&版GT; 1.5.2< /版本>
    < /依赖性>
    <&依赖性GT;
      <&的groupId GT; org.apache.tika< /的groupId>
      <&的artifactId GT;蒂卡的解析器< / artifactId的>
      <&版GT; 1.11 LT; /版本>
    < /依赖性>
    <&依赖性GT;
      <&的groupId GT; org.apache.tika< /的groupId>
      <&的artifactId GT;蒂卡芯< / artifactId的>
      <&版GT; 1.11 LT; /版本>
    < /依赖性>
  < /依赖和GT;
< /项目>

我的示例code是在这里:

  / * * TikaTime.java /
导入org.apache.spark.api.java *。
进口org.apache.spark.SparkConf;
进口org.apache.spark.api.java.function.Function;
进口java.io.FileInputStream中;
进口的java.io.InputStream;
进口的java.io.File;
进口java.io.IOException异常;
进口org.apache.tika *。公共类TikaTime {
  公共静态无效的主要(字串[] args)抛出IOException    字符串LOGFILE =file.txt的;
    文件日志文件=新的文件(/家/ file.txt的);
    SparkConf的conf =新SparkConf()setAppName(TikaTime);
    JavaSparkContext SC =新JavaSparkContext(CONF);
    JavaRDD<串GT; logData = sc.textFile(日志).cache();    长numAs = logData.filter(新功能<字符串,布尔值>(){
      公共布尔调用(String s)将{返回s.contains(A); }
    })。计数();    长期麻木= logData.filter(新功能<字符串,布尔值>(){
      公共布尔调用(String s)将{返回s.contains(B); }
    })。计数();    的System.out.println(有台词:+ numAs +,线,B:+麻木);    //提卡外观类。
    蒂卡蒂卡=新提卡();
  }
}

错误的堆栈跟踪如下:

 用线:2,线与B:1
异常线程mainjava.lang.NoClassDefFoundError的:组织/阿帕奇/蒂卡/提卡
    在TikaTime.main(TikaTime.java:32)
    在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)
    在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    在java.lang.reflect.Method.invoke(Method.java:606)
    在org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
    在org.apache.spark.deploy.SparkSubmit $ .doRunMain $ 1(SparkSubmit.scala:180)
    在org.apache.spark.deploy.SparkSubmit $ .submit(SparkSubmit.scala:205)
    在org.apache.spark.deploy.SparkSubmit $。主要(SparkSubmit.scala:120)
    在org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
抛出java.lang.ClassNotFoundException:引起org.apache.tika.Tika
    在java.net.URLClassLoader的$ 1.run(URLClassLoader.java:366)
    在java.net.URLClassLoader的$ 1.run(URLClassLoader.java:355)
    在java.security.AccessController.doPrivileged(本机方法)
    在java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    在java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    在java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    ... 10个

好奇,如果其他人遇到过这个问题。我很少使用Maven和我也有些新的火花,所以我不相信我的直觉是正确的这一点。

编辑:包括我的火花提交语法柜面它的兴趣。

 〜/ spark151 /火花1.5.1 /斌/火花提交--classTikaTime--master本地[4]目标/ tikaTime-1.0.jar


解决方案

每Gagravarr的反应和我最初的怀疑,这个问题是需要提供尤伯杯罐子的火花。这是使用Maven的树荫插件来实现。新的pom.xml 如下图所示。

 <项目>
<&的groupId GT; tika.test< /的groupId>
<&的artifactId GT; tikaTime< / artifactId的>
< modelVersion> 4.0.0< / modelVersion>
<名称>&TikaTime LT; /名称>
<包装和GT;&罐子LT; /包装>
<&版GT; 1.0 LT; /版本>
<依赖和GT;
  <&依赖性GT; <! - 星火依赖 - >
    <&的groupId GT; org.apache.spark< /的groupId>
    <&的artifactId GT;火花core_2.10< / artifactId的>
    <&版GT; 1.5.2< /版本>
  < /依赖性>
  <&依赖性GT;
    <&的groupId GT; org.apache.tika< /的groupId>
    <&的artifactId GT;蒂卡的解析器< / artifactId的>
    <&版GT; 1.11 LT; /版本>
  < /依赖性>
  <&依赖性GT;
    <&的groupId GT; org.apache.tika< /的groupId>
    <&的artifactId GT;蒂卡芯< / artifactId的>
    <&版GT; 1.11 LT; /版本>
  < /依赖性>
< /依赖和GT;
<建立>
  <&插件GT;
    <&插件GT;
      <&的groupId GT; org.apache.maven.plugins< /的groupId>
      <&的artifactId GT; Maven的遮阳插件< / artifactId的>
      <&版GT; 2.4.2< /版本>
      <&执行GT;
        <执行与GT;
          <阶段>包装及LT; /阶段>
          <目标>
            <&目标GT;遮阳和LT; /目标>
          < /目标>
        < /执行>
      < /处决>
      <结构>
          <过滤器和GT;
            &所述;滤光器>
              <&神器GT; *:*< /神器>
              <&排除GT;
                <排除方式> META-INF / * SF< /排除>
                <排除方式> META-INF / * DSA< /排除>
                <排除方式> META-INF / * RSA< /排除>
              < /排除>
            < /滤光器>
          < /过滤器>
          < finalName>尤伯杯 - $ {project.artifactId} - $ {project.version}< / finalName>
      < /结构>
    < /插件>
  < /插件>
< /构建>
< /项目>

请注意:您还必须提交从此创造了尤伯杯罐子,而不是引发原

I'm trying to use apache tika on top of Spark. However, i'm having issues with configuration. My best guess at the moment is that the dependencies (of which tika has a lot...) are not bundled with the JAR for spark. If this intuition is correct I am unsure what the best path forward is. But i am also not certain that that is even my issue.

The following is a pretty simple spark job which compiles but hits a runtime error when it gets to the Tika instantiation.

My pom.xml is as follows:

<project>
  <groupId>tika.test</groupId>
  <artifactId>tikaTime</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>TikaTime</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.5.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.11</version>
    </dependency>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-core</artifactId>
      <version>1.11</version>
    </dependency>
  </dependencies>
</project>

My sample code is here:

/* TikaTime.java */
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.File;
import java.io.IOException;
import org.apache.tika.*;

public class TikaTime {
  public static void main(String[] args) throws IOException {

    String logFile = "file.txt";
    File logfile = new File("/home/file.txt");
    SparkConf conf = new SparkConf().setAppName("TikaTime");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

    long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

    //Tika facade class.
    Tika tika = new Tika();
  }
}

Stack Trace of Error is as follows:

    Lines with a: 2, lines with b: 1
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tika/Tika
    at TikaTime.main(TikaTime.java:32)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.tika.Tika
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    ... 10 more

Curious if others have encountered this issue before. I rarely use Maven and am also somewhat new to Spark, so I'm not confident my intuition is correct on this.

Edit: Including my spark submit syntax incase it is of interest.

~/spark151/spark-1.5.1/bin/spark-submit --class "TikaTime" --master local[4] target/tikaTime-1.0.jar

解决方案

Per Gagravarr's response and my original suspicion, the issue was needing to provide the uber-jar to Spark. This was accomplished using the maven-shade plugin. New pom.xml shown below.

<project>
<groupId>tika.test</groupId>
<artifactId>tikaTime</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>TikaTime</name>
<packaging>jar</packaging>
<version>1.0</version>
<dependencies>
  <dependency> <!-- Spark dependency -->
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>1.5.2</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.11</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.11</version>
  </dependency>
</dependencies>
<build>
  <plugins>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-shade-plugin</artifactId>
      <version>2.4.2</version>
      <executions>
        <execution>
          <phase>package</phase>
          <goals>
            <goal>shade</goal>
          </goals>
        </execution>
      </executions>
      <configuration>
          <filters>
            <filter>
              <artifact>*:*</artifact>
              <excludes>
                <exclude>META-INF/*.SF</exclude>
                <exclude>META-INF/*.DSA</exclude>
                <exclude>META-INF/*.RSA</exclude>
              </excludes>
            </filter>
          </filters>
          <finalName>uber-${project.artifactId}-${project.version}</finalName>
      </configuration>
    </plugin>
  </plugins>
</build>
</project>

Note: you must also submit the uber-jar created from this to spark instead of the original.

这篇关于阿帕奇提卡1.11上星火NoClassDeftFoundError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆