无论如何,有没有比较两个avro文件以查看数据中存在哪些差异的方法? [英] Is there anyway to compare two avro files to see what differences exist in the data?

查看:96
本文介绍了无论如何,有没有比较两个avro文件以查看数据中存在哪些差异的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

理想情况下,我想要打包成SAS proc compare这样的东西给我:

Ideally, I'd like something packaged like SAS proc compare that can give me:

  • 每个数据集的行数

  • The count of rows for each dataset

一个数据集中,但另一个数据集中不存在的行数

The count of rows that exist in one dataset, but not the other

一个数据集中存在的变量,而另一个数据集中不存在

Variables that exist in one dataset, but not the other

两个文件中格式不同的变量(我知道这对于AVRO文件来说很少见,但有助于快速了解而不破译错误)

Variables that do not have the same format in the two files (I realize this would be rare for AVRO files, but would be helpful to know quickly without deciphering errors)

每列不匹配的行总数,以及列中所有不匹配或任何20个不匹配(以最小者为准)的表示形式

The total number of mismatching rows for each column, and a presentation of all the mismatches for a column or any 20 mismatches (whichever is smallest)

我已经研究出一种方法来确保数据集相等,但是效率很低.假设我们有两个具有100行和5列的avro文件(一个键和四个浮点功能).如果我们联接表并创建新变量,这些变量是数据集中匹配特征之间的差异,则任何非零差异都是数据中的某些不匹配.从那里可以很容易地确定上述要求的整个清单,但似乎似乎可以有更有效的方法.

I've worked out one way to make sure the datasets are equivalent, but it is pretty inefficient. Lets assume we have two avro files with 100 rows and 5 columns (one key and four float features). If we join the tables and create new variables that are the difference between the matching features from the datasets then any non-zero difference is some mismatch in the data. From there it could be pretty easy to determine the entire list of requirements above, but it just seems like there may be more efficient ways possible.

推荐答案

AVRO文件分别存储架构和数据.这意味着,除了具有数据的AVRO文件之外,您还应该有一个架构文件,通常它类似于* .avsc.这样,您的任务可以分为三部分:

AVRO files store the schema and data separately. This means that beside the AVRO file with the data you should have a schema file, usually it is something like *.avsc. This way your task can be split in 3 parts:

  1. 比较架构.这样,您可以获取这些文件中具有不同数据类型的字段,具有不同字段集的字段,等等.此任务非常简单,可以在Hadoop外部完成,例如在Python中完成

  1. Compare the schema. This way you can get the fields that have different data types in these files, have different set of fields and so on. This task is very easy and can be done outside of the Hadoop, for instance in Python:

import json
schema1 = json.load(open('schema1.avsc'))
schema2 = json.load(open('schema2.avsc'))
def print_cross (s1set, s2set, message):
    for s in s1set:
        if not s in s2set:
            print message % s
s1names = set( [ field['name'] for field in schema1['fields'] ] )
s2names = set( [ field['name'] for field in schema2['fields'] ] )
print_cross(s1names, s2names, 'Field "%s" exists in table1 and does not exist in table2')
print_cross(s2names, s1names, 'Field "%s" exists in table2 and does not exist in table1')
def print_cross2 (s1dict, s2dict, message):
    for s in s1dict:
        if s in s2dict:
            if s1dict[s] != s2dict[s]:
                print message % (s, s1dict[s], s2dict[s])
s1types = dict( zip( [ field['name'] for field in schema1['fields'] ],  [ str(field['type']) for field in schema1['fields'] ] ) )
s2types = dict( zip( [ field['name'] for field in schema2['fields'] ],  [ str(field['type']) for field in schema2['fields'] ] ) )
print_cross2 (s1types, s2types, 'Field "%s" has type "%s" in table1 and type "%s" in table2')

这是模式的示例:

{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}

{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int"]},
     {"name": "favorite_color", "type": ["string", "null"]},
     {"name": "test", "type": "int"}
 ]
}

这是输出:

[localhost:temp]$ python compare.py 
Field "test" exists in table2 and does not exist in table1
Field "favorite_number" has type "[u'int', u'null']" in table1 and type "[u'int']" intable2

  1. 如果架构相同(如果架构不相同,则可能不需要比较数据),则可以按以下方式进行比较.匹配任何情况的简单方法:为每一行计算md5哈希,根据此md5哈希的值连接两个表.如果将为您提供两个表中相同的行数,特定于table1的行数和特定于table2的行数.可以在Hive中轻松完成,这是MD5 UDF的代码: https://gist.github. com/dataminelab/1050002

要比较字段与字段之间的关系,您必须了解表的主键并在主键上连接两个表,比较字段

For comparing the field-to-field you have to know the primary key of the table and join two tables on primary key, comparing the fields

以前,我已经为表格开发了比较功能,它们通常看起来像这样:

Previously I've developed comparison functions for tables, and they usually looked like this:

  1. 检查两个表是否存在并可用
  2. 比较其架构.如果架构中存在一些不匹配的内容-中断
  3. 如果指定了主键:
  1. Check that both tables exists and available
  2. Compare their schema. If there are some mistmatches in schema - break
  3. If the primary key is specified:
  1. 使用完全外部联接在主键上联接两个表
  2. 计算每一行的md5哈希值
  3. 输出具有诊断的主键(PK仅存在于表1中,PK仅存在于表2中,PK存在于两个表中,但数据不匹配)
  4. 获取每个有问题的类的100行,并与两个表连接并输出到"mistmatch示例"表中

  • 如果未指定主键:

  • If the primary key is not specified:

    1. 计算每一行的md5哈希值
    2. 在md5hash值上将table1与table2完全外部联接
    3. 计数匹配的行数,仅在table1中存在行数,仅在table2中存在行数
    4. 获取每种薄雾匹配类型的100行样本,并输出到"mistmatch示例"表中

  • 通常开发和调试这种功能需要4-5个工作日

    Usually development and debugging such a function takes 4-5 business days

    这篇关于无论如何,有没有比较两个avro文件以查看数据中存在哪些差异的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆