有效识别差异 [英] Identifying Differences Efficiently

查看:103
本文介绍了有效识别差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每天,我们都会收到来自不同供应商的不同格式(CSV,XML,自定义)的巨大文件,需要将其上传到数据库中进行进一步处理.

Every day, we receive huge files from various vendors in different formats (CSV, XML, custom) which we need to upload into a database for further processing.

问题在于这些供应商将发送其数据的完整转储,而不仅仅是更新.在某些应用程序中,我们仅需要发送更新(即,仅更改记录).当前,我们要做的是将数据加载到临时表中,然后将其与以前的数据进行比较.这非常缓慢,因为数据集非常庞大,而且我们有时会缺少SLA.

The problem is that these vendors will send the full dump of their data and not just the updates. We have some applications where we need only send the updates (that is, the changed records only). What we do currently is to load the data into a staging table and then compare it against previous data. This is painfully slow as the data set is huge and we are occasionally missing SLAs.

是否有更快的方法来解决此问题?任何建议或帮助,不胜感激.我们的程序员用尽了所有的想法.

Is there a quicker way to resolve this issue? Any suggestions or help greatly appreciated. Our programmers are running out of ideas..

推荐答案

在完整的转储数据集中,有多种模式可用于检测增量,即更改的记录,新的记录和已删除的记录.

There are a number of patterns for detecting deltas, i.e. changed records, new records, and deleted records, in full dump data sets.

我见过的一种更有效的方法是为已经拥有的数据行创建哈希值,在数据库中创建导入的哈希值,然后将现有的哈希值与传入的哈希值进行比较.

One of the more efficient ways I've seen is to create hash values of the rows of data you already have, create hashes of the import once it's in the database, then compare the existing hashes to the incoming hashes.

主键匹配+哈希匹配=不变的行

Primary key match + hash match = Unchanged row

主键匹配+哈希不匹配=更新了行

Primary key match + hash mismatch = Updated row

输入数据中的主键,但现有数据集中缺少主键=新行

Primary key in incoming data but missing from existing data set = New row

主键不在传入数据中,而是在现有数据集中=已删除的行

Primary key not in incoming data but in existing data set = Deleted row

散列的方式因数据库产品而异,但是所有主要提供程序中都提供某种散列.

How to hash varies by database product, but all of the major providers have some sort of hashing available in them.

优点在于只需要比较少量字段(主键列和哈希),而不需要通过字段分析来比较一个字段.即使是很长的哈希也可以很快地进行分析.

The advantage comes from only having to compare a small number of fields (the primary key column(s) and the hash) rather than doing a field by field analysis. Even pretty long hashes can be analyzed pretty fast.

这将需要对导入处理进行一些返工,但是花费的时间将一次又一次地提高处理速度.

It'll require a little rework of your import processing, but the time spent will pay off over and over again in increased processing speed.

这篇关于有效识别差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆