使用冰岛荆棘角色作为Hive的分隔符 [英] Using the Icelandic Thorn character as a delimiter in Hive

查看:125
本文介绍了使用冰岛荆棘角色作为Hive的分隔符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将一些DoubleClick广告日志导入Hadoop。

I'm currently trying to import some DoubleClick advertising logs into Hadoop.

这些日志存储在一个gzip分隔的文件中,该文件使用页面1252(Windows- ANSI?),它使用冰岛荆棘角色作为分隔符。

These logs are stored in a gzip delimited file which is encoding using page 1252 (Windows-ANSI?) and which uses the Icelandic Thorn character as a delimiter.

我可以很高兴地将这些日志导入单个列,但我似乎找不到一种方式获取Hive了解Thorn角色 - 我想也许是因为它不了解1252编码?

I can happily import these logs into a single column, but I can't seem to find a way to get Hive to understand the Thorn character - I think maybe because it doesn't understand the 1252 encoding?

我看过了Create Table文档 - http://hive.apache.org/docs/r0.9.0/ language_manual / data-manipulation-statements.html - 但似乎找不到任何方式来获得这个编码/分隔符的工作。

I've looked at the Create Table documentation - http://hive.apache.org/docs/r0.9.0/language_manual/data-manipulation-statements.html - but can't seem to find any way to get this encoding/delimiter working.

我也有从 https://karmasphere.com/karmasphere-analyst-faq 看到的建议这些文件的编码是ISO-8859-1 - 但我没有看到如何在Hive或HDFS中使用该信息。

I've also seen from https://karmasphere.com/karmasphere-analyst-faq a suggestion that the encoding for these files is ISO-8859-1 - but I don't see how to use that info in Hive or HDFS.

我知道我可以做一个导入后的地图作业将这些行分成多个记录。

I know I can do a map job after import to split these rows into multiple records.

但直接使用这个分隔符有更简单的方法吗?

But is there an easier way to use this delimiter directly?

谢谢

Stuart

推荐答案

使用'\ -2'
char是一个有符号的字节。

use '\-2' the char is a signed byte.

显然,hive开发者不认为这是一个问题:
https://issues.apache.org/jira/browse/HIVE-237

apparently hive devs don't think it is a problem: https://issues.apache.org/jira/browse/HIVE-237

这篇关于使用冰岛荆棘角色作为Hive的分隔符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆