将LZO文件作为数据集导入到Java Spark中 [英] Importing a lzo file into java spark as dataset

查看:110
本文介绍了将LZO文件作为数据集导入到Java Spark中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些使用lzo压缩的tsv格式的数据.现在,我想在Java Spark程序中使用这些数据.

I have some data in tsv format compressed using lzo. Now, I would like to use these data in a java spark program.

此刻,我能够解压缩文件,然后使用

At the moment, I am able to decompress the files and then import them in Java as text files using

    SparkSession spark = SparkSession.builder()
            .master("local[2]")
            .appName("MyName")
            .getOrCreate();

    Dataset<Row> input = spark.read()
            .option("sep", "\t")
            .csv(args[0]);

    input.show(5);   // visually check if data were imported correctly

我在第一个参数中将路径传递到了解压缩的文件.如果我将lzo文件作为参数传递,则show的结果将是难以辨认的垃圾.

where I have passed the path to the decompressed file in the first argument. If I pass the lzo file as an argument, the result of show is illegible garbage.

有没有办法使它起作用?我将IntelliJ用作IDE,并且该项目是在Maven中设置的.

Is there a way to make it work? I use IntelliJ as an IDE and the project is set-up in Maven.

推荐答案

我找到了解决方案.它由两部分组成:安装hadoop-lzo软件包并进行配置.完成此操作后,如果将lzo文件导入到单个分区中就可以了,代码将与问题中的代码相同.

I found a solution. It consists of two parts: installing the hadoop-lzo package and configuring it; after doing this, the code will remain the same as in the question, provided one is OK with the lzo file being imported in a single partition.

在下面,我将解释如何在IntelliJ中设置的Maven项目中做到这一点.

In the following I will explain how to do it for a maven project set up in IntelliJ.

  • 安装软件包hadoop-lzo:您需要修改maven项目文件夹中的pom.xml文件.它应包含以下摘录:

  • Installing the package hadoop-lzo: you need to modify the pom.xml file that is in your maven project folder. It should contain the following excerpt:

<repositories>
    <repository>
        <id>twitter-twttr</id>
        <url>http://maven.twttr.com</url>
    </repository>
</repositories>

<properties>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
</properties>

<dependencies>

    <dependency>
        <!-- Apache Spark main library -->
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.1.0</version>
    </dependency>

    <dependency>
        <!-- Packages for datasets and dataframes -->
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>2.1.0</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/com.hadoop.gplcompression/hadoop-lzo -->
    <dependency>
        <groupId>com.hadoop.gplcompression</groupId>
        <artifactId>hadoop-lzo</artifactId>
        <version>0.4.20</version>
    </dependency>

</dependencies>

这将激活包含软件包 hadoop-lzo 并使该项目可用hadoop-lzo.

This will activate the maven Twitter repository that contains the package hadoop-lzo and make hadoop-lzo available for the project.

  • 第二步是创建一个core-site.xml文件,告诉hadoop您已经安装了新的编解码器.它应该放在程序文件夹中的某个位置.我将其放在src/main/resources/core-site.xml下,并将该文件夹标记为资源(在IntelliJ Project面板上,右键单击该文件夹->将目录标记为->资源根). core-site.xml文件应包含:

  • The second step is to create a core-site.xml file to tell hadoop that you have installed a new codec. It should be placed somewhere in the program folders. I put it under src/main/resources/core-site.xml and marked the folder as a resource (right click on the folder from the IntelliJ Project panel -> Mark Directory as -> Resources root). The core-site.xml file should contain:

<configuration>
    <property>
        <name>io.compression.codecs</name>
        <value>org.apache.hadoop.io.compress.DefaultCodec,
            com.hadoop.compression.lzo.LzoCodec,
            com.hadoop.compression.lzo.LzopCodec,
            org.apache.hadoop.io.compress.GzipCodec,
            org.apache.hadoop.io.compress.BZip2Codec</value>
    </property>
    <property>
        <name>io.compression.codec.lzo.class</name>
        <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>
</configuration>

就是这样!再次运行您的程序,它应该可以工作!

And that's it! Run your program again and it should work!

这篇关于将LZO文件作为数据集导入到Java Spark中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆