可以使用PIG读取的文件格式 [英] file formats that can be read using PIG

查看：161 发布时间：2018/5/31 19:31:03 hadoop apache-pig

本文介绍了可以使用PIG读取的文件格式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用PIG可以读取哪种文件格式？

如何以不同格式存储它们？假设我们有CSV文件，我想将它存储为MXL文件，这可以如何完成？无论何时我们使用STORE命令，它会生成目录，并将文件存储为part-m-00000，我如何更改文件的名称和覆盖目录？ div>

使用PIG可以读取哪种文件格式？我如何以不同的格式存储它们？

BinStorage - 二进制存储

PigStorage - 加载和存储由某些内容（比如制表符或逗号）分隔的数据
>

TextLoader - 逐行加载数据（即由换行符分隔）

piggybank 是一个由社区贡献的用户定义函数库，它有一些加载和存储方法，它包含一个XML加载器，但不包含XML存储器。

说我们有CSV文件你想把它作为MXL文件存储如何做到这一点？

我假设你的意思是XML ...以XML格式存储在Hadoop中有点粗糙，因为它会以reducer为基础分割文件，所以如何知道在哪里放置根标签？这可能应该是某种形式良好的XML后处理。

您可以做的一件事是编写一个UDF ，将列转换为XML字符串：
B = FOREACH A GENERATE customudfs.DataToXML（col1，col2，col3）;
例如，说 col1 ， col2 ， col3 是foo， 37 ，柠檬，分别。您的UDF可以输出字符串< item>< name> Foo< / name>< num> 37< / num>< fruit>柠檬< / fruit>< / item> 。

每当我们使用STORE命令时，它将文件存储为part-m-00000如何更改文件的名称和覆盖目录？

您无法更改输出文件的名称不是 part-m-00000 。这就是Hadoop的工作原理。如果你想改变它的名字，你应该用类似于 hadoop fs -mv output / part-m-00000 newoutput / myoutputfile 的方法做一些事情。这可以通过运行猪脚本的bash脚本来完成，然后执行该命令。

What kind of file formats can be read using PIG?

How can I store them in different formats? Say we have CSV file and I want to store it as MXL file how this can be done? Whenever we use STORE command it makes directory and it stores file as part-m-00000 how can I change name of the file and overwrite directory?
解决方案

what kind of file formats can be read using PIG? how can i store them in different formats?

There are a few built-in loading and storing methods, but they are limited:

BinStorage - "binary" storage

PigStorage - loads and stores data that is delimited by something (such as tab or comma)

TextLoader - loads data line by line (i.e., delimited by the newline character)

piggybank is a library of community contributed user-defined functions and it has a number of loading and storing methods, which includes an XML loader, but not a XML storer.

say we have CSV file n i want to store it as MXL file how this can be done?

I assume you mean XML here... Storing in XML is something that is a bit rough in Hadoop because it splits files on a reducer basis, so how do you know where to put the root tag? this likely should be some sort of post-processing to produce wellformed XML.

One thing you can do is to write a UDF that converts your columns into an XML string:
B = FOREACH A GENERATE customudfs.DataToXML(col1, col2, col3);
For example, say col1, col2, col3 are "foo", 37, "lemons", respectively. Your UDF can output the string "<item><name>Foo</name><num>37</num><fruit>lemons</fruit></item>".

whenever we use STORE command it makes directory and it stores file as part-m-00000 how can i change name of the file and overwrite directory?

You can't change the name of the output file to be something other than part-m-00000. That's just how Hadoop works. If you want to change the name of it, you should do something to it after the fact with something like hadoop fs -mv output/part-m-00000 newoutput/myoutputfile. This could be done with a bash script that runs the pig script then executes this command.

这篇关于可以使用PIG读取的文件格式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

可以使用PIG读取的文件格式 [英] file formats that can be read using PIG

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

可以使用PIG读取的文件格式 [英] file formats that can be read using PIG

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭