当检测数据已更改 [英] detecting when data has changed

查看:156
本文介绍了当检测数据已更改的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好了,故事是这样的:

- 我有大量文件(pretty的大,各地25GB),它是一个特定的格式,需要在数据存储要导入

-- I am having lots of files (pretty big, around 25GB) that are in a particular format and needs to be imported in a datastore

- 这些文件被连续地更新数据,有时新的,有时同一数据

-- these files are continuously updated with data, sometimes new, sometimes the same data

- 我想找出我如何可以检测,如果事情已经改变了在文件中的特定行,以最大限度地减少花在更新数据库时的算法

-- I am trying to figure out an algorithm on how could I detect if something has changed for a particular line in a file, in order to minimize the time spent updating the database

- 它目前的工作,现在的方式是,我每次删除所有的数据库中的数据,然后重新导入,但是这不会工作了,因为我需要一个时间戳,当一个项目已更改

-- the way it currently works now is that I'm dropping all the data in the database each time and then reimport it, but this won't work anymore since I'll need a timestamp for when an item has changed.

- 该文件包含字符串和数字(​​标题,订单,价格等)

-- the files contains strings and numbers (titles, orders, prices etc.)

唯一的解决方案,我能想到的是:

The only solutions I could think of are:

- 计算一个散列从数据库的每一行,即它的相比较的行从文件的哈希而且如果是不同的更新的数据库

-- compute a hash for each row from the database, that it's compared against the hash of the row from the file and if they're different the update the database

- 保持2份文件中,previous 1和当前的人的,并在其上​​的diff文件(这可能比更新更快分贝),并基于这些更新数据库

-- keep 2 copies of the files, the previous ones and the current ones and make diffs on it (which probably are faster than updating the db) and based on those update the db.

由于数据量非常大,巨大的,我种出来的选择了。从长远来看,我会干掉的文件和数据都将被直接推到数据库中,但问题仍然存在。

Since the amount of data is very big to huge, I am kind of out of options for now. On the long run, I'll get rid of the files and data will be pushed straight into the database, but the problem still remains.

任何意见,将AP preciated。

Any advice, will be appreciated.

推荐答案

所理解的问题定义

假设你的文件中包含

ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40

正如你说行,可以添加/更新,因此文件将成为

As you stated Row can be added / updated , hence the file becomes

ID,Name,Age
1,Jim,20    -- to be discarded 
2,Tim,35    -- to be updated
3,Kim,40    -- to be discarded 
4,Zim,30    --  to be inserted 

现在的要求是通过插入/含两个SQL语句两个SQL查询或1批查询只更新上述2条记录更新数据库。

Now the requirement is to update the database by inserting / updating only above 2 records in two sql queries or 1 batch query containing two sql statements.

我想提出以下的假设此处

  • 您不能修改现有流程来创建文件。
  • 您使用的是一些批处理[从文件中读取 - 在数据库处理的内存 - 写作] 上传在数据库中的数据。

  • You cannot modify the existing process to create files.
  • You are using some batch processing [Reading from file - Processing in Memory- Writing in DB] to upload the data in the database.
  • 存储记录[姓名,年龄]对ID的散列值在内存中的地图,其中ID为键和值是哈希[如果你需要的可扩展性使用hazelcast。

    Store the hash values of Record [Name,Age] against ID in an in-memory Map where ID is the key and Value is hash [If you require scalability use hazelcast ].

    您批处理框架加载数据[再假设对待一行文件作为一个记录],需要使用批处理框架要检查的ID计算的散列值在内存中的Map.First时间创作也可以做读取文件。<​​/ P>

    Your Batch Framework to load the data [Again assuming treats one line of file as one record], needs to check the computed hash value against the ID in in-memory Map.First time creation can also be done using your batch framework for reading files.

     If (ID present)
    --- compare hash 
    ---found same then discard it
    —found different create an update sql 
    In case ID not present in in-memory hash,create an insert sql and insert the hashvalue
    

    您可能会去进行并行处理,大块处理和利用弹簧批次和hazelcast在内存中的数据分区。

    You might go for parallel processing , chunk processing and in-memory data partitioning using spring-batch and hazelcast.

    http://www.hazelcast.com/

    http://static.springframework.org/spring-batch/

    希望这有助于。

    这篇关于当检测数据已更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆