有选择地将iis日志文件加载到Hive中 [英] Selectively loading iis log files into Hive

查看:121
本文介绍了有选择地将iis日志文件加载到Hive中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我目前拥有〜50GB的存储空间将iis日志加载到具有以下目录结构的hdfs:


/ user / oi / raw_iis / Webserver1 / Org / SubOrg / W3SVC1056242793 /
/ user / oi / raw_iis / Webserver2 / Org / SubOrg / W3SVC1888303555 /
/ user / oi / raw_iis / Webserver3 / Org / SubOrg / W3SVC1056245683 /

etc


我想将所有日志加载到Hive表中。



我有两个问题/问题:

1。



我的第一个问题是, Web服务器可能没有正确配置,并且会有没有所有列的iis日志。这些不正确的日志需要额外的处理来将日志中的可用列映射到包含所有列的模式。



数据是空格分隔的,问题是当不是全部列已启用,日志仅包含已启用的列。由于数据不包含空列,Hive不能自动插入空值。我需要能够将日志中的可用列映射到完整模式。



示例正确的日志:

  #Fields:date time s-ip cs-method cs -uri-stem useragent 
2013-07-16 00:00:00 10.1.15.8 GET / common / viewFile / 1232 Mozilla / 5.0 + AppleWebKit / 537.36 +(KHTML,+ like + Gecko)+ Chrome / 27.0.1453.116 + Safari / 537.36

缺少列的示例日志(cs-method和useragent):

  #Fields:date time s -ip cs -uri-stem 
2013-07-16 00:00:00 10.1.15.8 / common / viewFile / 1232

缺少列的日志需要像这样映射到完整模式:

 #字段:日期时间s-ip cs-method cs -uri-stem useragent 
2013-07-16 00:00:00 10.1.15.8 null / common / viewFile / 1232 null

如何将这些已启用的字段映射到包含所有可能列的模式,为缺失的字段插入空白/空/ - 标记?这是我可以用Pig脚本处理的事情吗?



2。



如何定义我的Hive表在我的dir结构示例中包含来自hdfs路径的信息,即Org和SubOrg,以便它可以在Hive中进行查询?我也不确定如何正确地将数据从多个目录导入到单个配置单元表中。

解决方案

我能够解决这两个问题我与Pig UDF(用户定义的函数)有关的问题


  1. 将列映射到适当的模式:请参阅 answer and this 一个

我真正必须做的是添加一些逻辑来处理以#开头的iis头文件。下面是我使用的getNext()的片段,其他部分与mr2ert的示例代码相同。



查看值[0] .equals(#Fields:)部分。

  @Override 
public Tuple getNext()throws IOException {
...

Tuple t = mTupleFactory.newTuple(1);

//忽略标题行,除了字段定义
if(values [0] .startsWith(#)&&!values [0] .equals(#Fields: )){
return t;
}
ArrayList< String> tf = new ArrayList< String>();
int pos = 0;

for(int i = 0; i< values.length; i ++){
if(fieldHeaders == null || values [0] .equals(#Fields:) ){
//抓住字段标题,忽略值为[0]的#Fields:标记
if(i> 0){
tf.add(values [i]);
}
fieldHeaders = tf;
} else {
readField(values [i],pos);
pos = pos + 1;
}
}
...
}




  1. 为了包含来自文件路径的信息,我将以下内容添加到我用于解决的LoadFunc UDF中1.在prepareToRead覆盖中,获取文件路径并存储它在一个成员变量中。

      public class IISLoader扩展了LoadFunc {
    ...
    @Override
    public void prepareToRead(RecordReader reader,PigSplit split){$ b $ in = reader;
    filePath =((FileSplit)split.getWrappedSplit())。getPath()。toString();
    }


然后在getNext )我可以将路径添加到输出元组中。


I am just getting started with Hadoop/Pig/Hive on the cloudera platform and have questions on how to effectively load data for querying.

I currently have ~50GB of iis logs loaded into hdfs with the following directory structure:

/user/oi/raw_iis/Webserver1/Org/SubOrg/W3SVC1056242793/ /user/oi/raw_iis/Webserver2/Org/SubOrg/W3SVC1888303555/ /user/oi/raw_iis/Webserver3/Org/SubOrg/W3SVC1056245683/

etc

I would like to load all the logs into a Hive table.

I have two issues/questions:

1.

My first issue is that some of the webservers may not have been configured correctly and will have iis logs without all columns. These incorrect logs need additional processing to map the available columns in the log to the schema that contains all columns.

The data is space delimited, the issue is that when not all columns are enabled, the log only includes the columns enabled. Hive cant automatically insert nulls since the data does not include the columns that are empty. I need to be able to map the available columns in the log to the full schema.

Example good log:

#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 GET /common/viewFile/1232 Mozilla/5.0+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/27.0.1453.116+Safari/537.36

Example log with missing columns (cs-method and useragent):

#Fields: date time s-ip cs-uri-stem 
2013-07-16 00:00:00 10.1.15.8 /common/viewFile/1232

The log with missing columns needs to be mapped to the full schema like this:

#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 null /common/viewFile/1232 null

How can I map these enabled fields to a schema that includes all possible columns, inserting blank/null/- token for fields that were missing? Is this something I could handle with a Pig script?

2.

How can I define my Hive tables to include information from the hdfs path, namely Org and SubOrg in my dir structure example so that it is query-able in Hive? I am also unsure how to properly import data from the many directories into a single hive table.

解决方案

I was able to solve both my issues with Pig UDF (user defined functions)

  1. Mapping columns to proper schema: See this answer and this one.

All I really had to do is add some logic to handle the iis headers that start with #. Below are the snippets from getNext() that I used, everything else is the same as mr2ert's example code.

See the values[0].equals("#Fields:") parts.

        @Override
        public Tuple getNext() throws IOException {
            ...

            Tuple t =  mTupleFactory.newTuple(1);

            // ignore header lines except the field definitions
            if(values[0].startsWith("#") && !values[0].equals("#Fields:")) {
                return t;
            }
            ArrayList<String> tf = new ArrayList<String>();
            int pos = 0;

            for (int i = 0; i < values.length; i++) {
                if (fieldHeaders == null || values[0].equals("#Fields:")) {
                    // grab field headers ignoring the #Fields: token at values[0]
                    if(i > 0) {
                        tf.add(values[i]);
                    }
                    fieldHeaders = tf;
                } else {
                    readField(values[i], pos);
                    pos = pos + 1;
                }
            }
            ...
         }

  1. To include information from the file path, I added the following to my LoadFunc UDF that I used to solve 1. In the prepareToRead override, grab the filepath and store it in a member variable.

    public class IISLoader extends LoadFunc {
        ...
        @Override
        public void prepareToRead(RecordReader reader, PigSplit split) {
            in = reader;
            filePath = ((FileSplit)split.getWrappedSplit()).getPath().toString();
        }
    

Then within getNext() I could add the path to the output tuple.

这篇关于有选择地将iis日志文件加载到Hive中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆