Spark NoClassDeftFoundError 上的 Apache Tika 1.11 [英] Apache Tika 1.11 on Spark NoClassDeftFoundError

查看:22
本文介绍了Spark NoClassDeftFoundError 上的 Apache Tika 1.11的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Spark 之上使用 apache tika.但是,我遇到了配置问题.我目前最好的猜测是依赖项(其中 tika 有很多......)没有与 Spark 的 JAR 捆绑在一起.如果这种直觉是正确的,我不确定前进的最佳途径是什么.但我也不确定这是否是我的问题.

I'm trying to use apache tika on top of Spark. However, i'm having issues with configuration. My best guess at the moment is that the dependencies (of which tika has a lot...) are not bundled with the JAR for spark. If this intuition is correct I am unsure what the best path forward is. But i am also not certain that that is even my issue.

以下是一个非常简单的 spark 作业,它可以编译但在到达 Tika 实例化时遇到运行时错误.

The following is a pretty simple spark job which compiles but hits a runtime error when it gets to the Tika instantiation.

我的pom.xml如下:

<project>
  <groupId>tika.test</groupId>
  <artifactId>tikaTime</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>TikaTime</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.5.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.11</version>
    </dependency>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-core</artifactId>
      <version>1.11</version>
    </dependency>
  </dependencies>
</project>

我的示例代码在这里:

/* TikaTime.java */
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.File;
import java.io.IOException;
import org.apache.tika.*;

public class TikaTime {
  public static void main(String[] args) throws IOException {

    String logFile = "file.txt";
    File logfile = new File("/home/file.txt");
    SparkConf conf = new SparkConf().setAppName("TikaTime");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

    long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

    //Tika facade class.
    Tika tika = new Tika();
  }
}

错误堆栈跟踪如下:

    Lines with a: 2, lines with b: 1
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tika/Tika
    at TikaTime.main(TikaTime.java:32)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.tika.Tika
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    ... 10 more

好奇其他人之前是否遇到过这个问题.我很少使用 Maven,对 Spark 也有些陌生,所以我不确定我的直觉是否正确.

Curious if others have encountered this issue before. I rarely use Maven and am also somewhat new to Spark, so I'm not confident my intuition is correct on this.

包括我的 spark 提交语法,以防它引起兴趣.

Including my spark submit syntax incase it is of interest.

~/spark151/spark-1.5.1/bin/spark-submit --class "TikaTime" --master local[4] target/tikaTime-1.0.jar

推荐答案

根据 Gagravarr 的回应和我最初的怀疑,问题是需要向 Spark 提供 uber-jar.这是使用 maven-shade 插件完成的.新的 pom.xml 如下所示.

Per Gagravarr's response and my original suspicion, the issue was needing to provide the uber-jar to Spark. This was accomplished using the maven-shade plugin. New pom.xml shown below.

<project>
<groupId>tika.test</groupId>
<artifactId>tikaTime</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>TikaTime</name>
<packaging>jar</packaging>
<version>1.0</version>
<dependencies>
  <dependency> <!-- Spark dependency -->
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>1.5.2</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.11</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.11</version>
  </dependency>
</dependencies>
<build>
  <plugins>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-shade-plugin</artifactId>
      <version>2.4.2</version>
      <executions>
        <execution>
          <phase>package</phase>
          <goals>
            <goal>shade</goal>
          </goals>
        </execution>
      </executions>
      <configuration>
          <filters>
            <filter>
              <artifact>*:*</artifact>
              <excludes>
                <exclude>META-INF/*.SF</exclude>
                <exclude>META-INF/*.DSA</exclude>
                <exclude>META-INF/*.RSA</exclude>
              </excludes>
            </filter>
          </filters>
          <finalName>uber-${project.artifactId}-${project.version}</finalName>
      </configuration>
    </plugin>
  </plugins>
</build>
</project>

注意:您还必须提交由此创建的 uber-jar 来 spark 而不是原始的.

Note: you must also submit the uber-jar created from this to spark instead of the original.

这篇关于Spark NoClassDeftFoundError 上的 Apache Tika 1.11的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆