最终的MySQL遗留数据库噩梦 [英] The ultimate MySQL legacy database nightmare

查看:210
本文介绍了最终的MySQL遗留数据库噩梦的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

表1:
一切都包括厨房水槽。日期格式错误(最后一年,因此您无法对该列进行排序),数字存储为VARCHAR,完整地址在街道列中,firstname和lastname在firstname列中,城市在lastname列中,不完整的地址,行通过将数据从一个字段移动到另一个字段来更新前面的行基于一些规则,这些年来更改了,重复记录,不完整的记录,垃圾记录...你命名它...哦,当然不是一个TIMESTAMP或PRIMARY KEY列。

Table1: Everything including the kitchen sink. Dates in the wrong format (year last so you cannot sort on that column), Numbers stored as VARCHAR, complete addresses in the 'street' column, firstname and lastname in the firstname column, city in the lastname column, incomplete addresses, Rows that update preceeding rows by moving data from one field to another based on some set of rules that has changed over the years, duplicate records, incomplete records, garbage records... you name it... oh and of course not a TIMESTAMP or PRIMARY KEY column in sight.

表2:
任何希望的正常化都会在打开这个宝贝时打开窗口。
我们为每个条目都有一行,并更新表1中的行。所以重复,像有没有明天(800MB价值)和列像Phone1 Phone2 Phone3 Phone4 ... Phone15(他们不称为电话,我用这个插图)的foriegn键是..好的猜测。根据表1中的行中的数据类型,有三个候选项

Table2: Any hope of normalization went out the window upon cracking this baby open. We have a row for each entry AND update of rows in table one. So duplicates like there is no tomorrow (800MB worth) and columns like Phone1 Phone2 Phone3 Phone4 ... Phone15 (they are not called phone. I use this for illustration) The foriegn key is.. well take guess. There are three candidates depending on what kind of data was in the row in table1

表3:
它可能会更糟。哦,是的。
外键是一个VARCHAR列组合的破折号,点,数字和字母!如果不提供匹配(它通常不匹配),那么类似产品代码的第二列应该是列具有与其中的数据不相关的名称,以及必须的Phone1 Phone2 Phone3 Phone4 ... Phone15。从表1中复制列,而不是TIMESTAMP或PRIMARY KEY列。

Table3: Can it get any worse. Oh yes. The "foreign key is a VARCHAR column combination of dashes, dots, numbers and letters! if that doesn't provide the match (which it often doesn't) then a second column of similar product code should. Columns that have names that bear NO correlation to the data within them, and the obligatory Phone1 Phone2 Phone3 Phone4... Phone15. There are columns Duplicated from Table1 and not a TIMESTAMP or PRIMARY KEY column in sight.

表4:被描述为进步中的一件作品,随时可能发生变化。它与其他人的本质非常相似。

Table4: was described as a work in progess and subject to change at any moment. It is essentailly simlar to the others.

接近1米幸运的是,这不是我的大混乱,不幸的是,我必须为每个客户提取一个合成记录。

At close to 1m rows this is a BIG mess. Luckily it is not my big mess. Unluckily I have to pull out of it a composit record for each "customer".

最初我设计添加一个PRIMARY KEY并将所有日期转换为可排序格式的一个四步翻译Table1然后几个查询返回过滤的数据的步骤,直到我有Table1,我可以使用它从其他表中拉出形成复合。经过几个星期的工作,我得到这一步使用一些技巧。所以现在我可以指点我的应用程序在混乱,拉出一个很好的干净的合成数据表。幸运的是,我只需要一个电话号码为我的目的,所以规范化我的表不是一个问题。

Initially I devised a four step translation of Table1 adding a PRIMARY KEY and converting all the dates into sortable format. Then a couple more steps of queries that returned filtered data until I had Table1 to where I could use it to pull from the other tables to form the composit. After weeks of work I got this down to one step using some tricks. So now I can point my app at the mess and pull out a nice clean table of composited data. Luckily I only need one of the phone numbers for my purposes so normalizing my table is not an issue.

但是这是真正的任务开始的地方,因为每天有成百上千的员工以不想要的方式添加/更新/删除这个数据库必须检索新行。

However this is where the real task begins, because every day hundreds of employees add/update/delete this database in ways you don't want to imagine and every night I must retrieve the new rows.

由于任何表中的现有行都可以更改,并且由于没有TIMESTAMP ON UPDATE列,因此我必须使用日志来了解发生了什么。当然,这假设有一个二进制日志,这是没有的!

Since existing rows in any of the tables can be changed, and since there are no TIMESTAMP ON UPDATE columns, I will have to resort to the logs to know what has happened. Of course this assumes that there is a binary log, which there is not!

这个概念下降了像铅气球。我可能也告诉他们,他们的孩子将要接受实验手术。他们不是高科技...如果你没有聚集...

Introducing the concept went down like lead balloon. I might as well have told them that their children are going to have to undergo experimental surgery. They are not exactly hi tech... in case you hadn't gathered...

这种情况有点微妙,因为他们有一些有用的信息,我的公司想要严重。我已经被一家大公司(你知道他们是怎么做的)的高级管理人员送去让它发生。

The situation is a little delicate as they have some valuable information that my company wants badly. I have been sent down by senior management of a large corporation (you know how they are) to "make it happen".

我不能想到任何其他方式处理每晚的更新,比解析bin日志文件与另一个应用程序,以找出他们在白天做了什么,然后复合我的表相应。我真的只需要看看他们的table1,以找出对我的表做什么。其他表只提供刷新记录的字段。 (使用MASTER SLAVE不会有帮助,因为我会有一个副本的麻烦。)

I can't think of any other way to handle the nightly updates, than parsing the bin log file with yet another application, to figure out what they have done to that database during the day and then composite my table accordingly. I really only need to look at their table1 to figure out what to do to my table. The other tables just provide fields to flush out the record. (Using MASTER SLAVE won't help because I will have a duplicate of the mess.)

另一种方法是为他们的table1和build1的每一行创建一个唯一的哈希哈希表。然后我每天晚上通过ENTIRE数据库检查,看看哈希是否匹配。如果他们没有那么我会读取该记录,并检查它是否存在于我的数据库,如果它然后我会更新它在我的数据库,如果它不是一个新的记录,我会插入它。这是丑陋,不快,但解析一个二进制日志文件是不漂亮的。

The alternative is to create a unique hash for every row of their table1 and build a hash table. Then I would go through the ENTIRE database every night checking to see if the hashs match. If they do not then I would read that record and check if it exists in my database, if it does then I would update it in my database, if it doesn't then its a new record and I would INSERT it. This is ugly and not fast, but parsing a binary log file is not pretty either.

我写了这个来帮助澄清这个问题。经常告诉别人帮助澄清问题使解决方案更明显。在这种情况下,我只是有一个更大的头痛!

I have written this to help get clear about the problem. often telling it to someone else helps clarify the problem making a solution more obvious. In this case I just have a bigger headache!

您的想法会非常感激。

推荐答案

日志文件(二进制日志)是我第一个想到的。如果你知道他们做了什么事情,你会颤抖。对于每一行,在日志中有许多条目被添加和更改。它只是巨大的!
现在我决定使用哈希法。有了一些聪明的文件存储器分页,这是很快。

The Log Files (binary Logs) were my first thought too. If you knew how they did things you would shudder. For every row there are many many entries in the log as pieces are added and changed. Its just HUGE! For now I settled upon the Hash approach. With some clever file memory paging this is quite fast.

这篇关于最终的MySQL遗留数据库噩梦的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆