Hive表格的JSON文件的条目分隔符 [英] Entry delimiter of JSON files for Hive table

查看:276
本文介绍了Hive表格的JSON文件的条目分隔符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在通过REST API调用收集JSON数据(特别是公共社交媒体帖子),我们计划将其转储到HDFS中,然后使用SerDe在其顶部抽象一个Hive表。我想知道在文件中每个JSON条目的合适分隔符是多少?是新行(\\\
)?所以它看起来像这样:

  {id:entry1 ... post:} 
{id:entry2。 .. post:}
...
{id:entryn ... post:}

如果我们在JSON数据本身中遇到一个新的行字符,例如在 post

中,

解决方案

最好的方法是每行一条记录,由\ n完全按照您的猜测分隔。
这也意味着你应该小心地转义可能在JSON元素内部的\ n。
由于要分发处理,hadoop必须能够告诉记录什么时候结束,因此缩进的JSON在hadoop / hive中不能很好地工作,所以它可以将W个工作文件的处理分割为N个字节,大小大致N / W。
在文本TextInputFormat的情况下,分割由已使用的特定InputFormat完成。
TextInputFormat 在第一个字节i * N / W(i从1到W-1)后找到的\ n的第一个实例基本上分割文件。
由于这个原因,其他\\\
会混淆Hadoop,它会给你不完整的记录。

另外,我不会推荐它,但如果你真的想要通过配置属性textinputformat.record.delimiter,通过hadoop / hive读取文件,使用不会在JSON中的字符,可以使用\ n以外的字符(例如,\001或CTRL-A通常被Hive用作字段分隔符),但这可能会非常棘手,因为它也必须得到SerDe的支持。
另外,如果你改变记录分隔符,任何在HDFS上复制/使用文件的人都必须知道分隔符,否则他们将无法正确解析它,并且需要特殊的代码才能完成它,在保留\\\
作为分隔符的情况下,这些文件仍然是普通的文本文件,可以被其他工具使用。



至于SerDe,我会推荐这一个,以及我写的免责声明:)
https:// github。 com / rcongiu / Hive-JSON-Serde

We are collecting JSON data (public social media posts in particular) via REST API invocations, which we plan to dump into HDFS, then abstract a Hive table on top it using SerDe. I wonder though what would be the appropriate delimiter per JSON entry in a file? Is it new line ("\n")? So it would look like this:

{ id: entry1 ... post: }
{ id: entry2 ... post: }
...
{ id: entryn ... post: }

How about if we encounter a new line character within the JSON data itself, for example in post?

解决方案

The best way would be one record per line, separated by "\n" exactly as you guessed. This also means that you should be careful to escape "\n" that may be inside the JSON elements. Indented JSON won't work well with hadoop/hive, since to distribute processing, hadoop must be able to tell when a records ends, so it can split processing of a file with N bytes with W workers in W chunks of size roughly N/W. The splitting is done by the particular InputFormat that's been used, in case of text, TextInputFormat. TextInputFormat will basically split the file at the first instance of "\n" found after byte i*N/W (for i from 1 to W-1). For this reason, having other "\n" around would confuse Hadoop and it will give you incomplete records.

As an alternative, I wouldn't recommend it, but if you really wanted you could use a character other than "\n" by configuring the property "textinputformat.record.delimiter" when reading the file through hadoop/hive, using a character that won't be in JSON (for instance, \001 or CTRL-A is commonly used by Hive as a field delimiter) but that can be tricky since it has to also be supported by the SerDe. Also, if you change the record delimiter, anybody who copies/uses the file on HDFS must be aware of the delimiter, or they won't be able to parse it correctly, and will need special code to do it, while keeping "\n" as a delimiter, the files will still be normal text files and can be used by other tools.

As for the SerDe, I'd recommend this one, with the disclaimer that I wrote it :) https://github.com/rcongiu/Hive-JSON-Serde

这篇关于Hive表格的JSON文件的条目分隔符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆