有选择地将 iis 日志文件加载到 Hive 中 [英] Selectively loading iis log files into Hive

查看:22
本文介绍了有选择地将 iis 日志文件加载到 Hive 中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始在 cloudera 平台上使用 Hadoop/Pig/Hive,对如何有效加载数据进行查询有疑问.

I am just getting started with Hadoop/Pig/Hive on the cloudera platform and have questions on how to effectively load data for querying.

我目前有大约 50GB 的 iis 日志加载到 hdfs 中,目录结构如下:

I currently have ~50GB of iis logs loaded into hdfs with the following directory structure:

/user/oi/raw_iis/Webserver1/Org/SubOrg/W3SVC1056242793//user/oi/raw_iis/Webserver2/Org/SubOrg/W3SVC1888303555//user/oi/raw_iis/Webserver3/Org/SubOrg/W3SVC1056245683/

/user/oi/raw_iis/Webserver1/Org/SubOrg/W3SVC1056242793/ /user/oi/raw_iis/Webserver2/Org/SubOrg/W3SVC1888303555/ /user/oi/raw_iis/Webserver3/Org/SubOrg/W3SVC1056245683/

我想将所有日志加载到 Hive 表中.

I would like to load all the logs into a Hive table.

我有两个问题:

我的第一个问题是某些网络服务器可能没有正确配置,并且会包含没有所有列的 iis 日志.这些不正确的日志需要额外的处理才能将日志中的可用列映射到包含所有列的架构.

My first issue is that some of the webservers may not have been configured correctly and will have iis logs without all columns. These incorrect logs need additional processing to map the available columns in the log to the schema that contains all columns.

数据是用空格分隔的,问题是当不是所有列都启用时,日志只包含启用的列.Hive 无法自动插入空值,因为数据不包括空列.我需要能够将日志中的可用列映射到完整架构.

The data is space delimited, the issue is that when not all columns are enabled, the log only includes the columns enabled. Hive cant automatically insert nulls since the data does not include the columns that are empty. I need to be able to map the available columns in the log to the full schema.

好的日志示例:

#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 GET /common/viewFile/1232 Mozilla/5.0+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/27.0.1453.116+Safari/537.36

缺少列的示例日志(cs-method 和 useragent):

Example log with missing columns (cs-method and useragent):

#Fields: date time s-ip cs-uri-stem 
2013-07-16 00:00:00 10.1.15.8 /common/viewFile/1232

缺少列的日志需要像这样映射到完整模式:

The log with missing columns needs to be mapped to the full schema like this:

#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 null /common/viewFile/1232 null

如何将这些启用的字段映射到包含所有可能列的架构,为缺失的字段插入空白/空/- 标记?这是我可以用 Pig 脚本处理的事情吗?

How can I map these enabled fields to a schema that includes all possible columns, inserting blank/null/- token for fields that were missing? Is this something I could handle with a Pig script?

如何定义我的 Hive 表以包含来自 hdfs 路径的信息,即我的目录结构示例中的 Org 和 SubOrg,以便在 Hive 中可以查询?我也不确定如何将多个目录中的数据正确导入到单个配置单元表中.

How can I define my Hive tables to include information from the hdfs path, namely Org and SubOrg in my dir structure example so that it is query-able in Hive? I am also unsure how to properly import data from the many directories into a single hive table.

推荐答案

我能够使用 Pig UDF(用户定义函数)解决我的两个问题

I was able to solve both my issues with Pig UDF (user defined functions)

  1. 将列映射到正确的架构:请参阅此答案 和这个 一个.
  1. Mapping columns to proper schema: See this answer and this one.

我真正需要做的就是添加一些逻辑来处理以 # 开头的 iis 标头.下面是我使用的 getNext() 的片段,其他一切都与 mr2ert 的示例代码相同.

All I really had to do is add some logic to handle the iis headers that start with #. Below are the snippets from getNext() that I used, everything else is the same as mr2ert's example code.

查看 values[0].equals("#Fields:") 部分.

See the values[0].equals("#Fields:") parts.

        @Override
        public Tuple getNext() throws IOException {
            ...

            Tuple t =  mTupleFactory.newTuple(1);

            // ignore header lines except the field definitions
            if(values[0].startsWith("#") && !values[0].equals("#Fields:")) {
                return t;
            }
            ArrayList<String> tf = new ArrayList<String>();
            int pos = 0;

            for (int i = 0; i < values.length; i++) {
                if (fieldHeaders == null || values[0].equals("#Fields:")) {
                    // grab field headers ignoring the #Fields: token at values[0]
                    if(i > 0) {
                        tf.add(values[i]);
                    }
                    fieldHeaders = tf;
                } else {
                    readField(values[i], pos);
                    pos = pos + 1;
                }
            }
            ...
         }

  1. 为了包含文件路径中的信息,我将以下内容添加到我用来解决的 LoadFunc UDF 1. 在 prepareToRead 覆盖中,获取文件路径并将其存储在成员变量中.

  1. To include information from the file path, I added the following to my LoadFunc UDF that I used to solve 1. In the prepareToRead override, grab the filepath and store it in a member variable.

public class IISLoader extends LoadFunc {
    ...
    @Override
    public void prepareToRead(RecordReader reader, PigSplit split) {
        in = reader;
        filePath = ((FileSplit)split.getWrappedSplit()).getPath().toString();
    }

然后在 getNext() 中,我可以将路径添加到输出元组.

Then within getNext() I could add the path to the output tuple.

这篇关于有选择地将 iis 日志文件加载到 Hive 中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆