导入新数据库表 [英] Importing new database table

查看:131
本文介绍了导入新数据库表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在那里有一个主系统运行在一个大的AIX大型机上。到设施报告和操作有从主机到SQL Server的夜间转储,使我们的每个50-ish客户端在他们自己的数据库中具有相同的模式。这个转储大约需要7个小时完成每一个晚上,我们没有什么可以做的:我们坚持应用程序供应商提供的。

Where I'm at there is a main system that runs on a big AIX mainframe. To facility reporting and operations there is nightly dump from the mainframe into SQL Server, such that each of our 50-ish clients is in their own database with identical schemas. This dump takes about 7 hours to finish each night, and there's not really anything we can do about it: we're stuck with what the application vendor has provided.

转储到sql server我们使用那个运行一些其他的日常程序。其中一个过程是将数据导入到一种管理报告沙箱表中,该表将来自不同数据库的特别重要的表的记录合并到一个表中,管理员不知道sql,因此可以用来运行特别报告而不会堵塞系统的其余部分。这又是一个商业的事情:经理们想要它,他们有权力看到我们实现它。

After the dump into sql server we use that to run a number of other daily procedures. One of those procedures is to import data into a kind of management reporting sandbox table, which combines records from a particularly important table from across the different databases into one table that managers who don't know sql so can use to run ad-hoc reports without hosing up the rest of the system. This, again, is a business thing: the managers want it, and they have the power to see that we implement it.

此表的导入过程需要几个小时的时间。它将分布在50个数据库中的大约4千万条记录过滤到大约4百万条记录中,然后将它们索引到某些列上进行搜索。即使在coupld时间,它仍然不到初始负载的三分之一,但是我们没有时间过夜处理,我们不控制大型机转储,我们控制这一点。因此,我的任务是寻找改进现有过程的方法。

The import process for this table takes a couple hours on it's own. It filters down about 40 million records spread across 50 databases into about 4 million records, and then indexes them on certain columns for searching. Even at a coupld hours it's still less than a third as long as the initial load, but we're running out of time for overnight processing, we don't control the mainframe dump, and we do control this. So I've been tasked with looking for ways to improve one the existing procedure.

目前,其理念是从每个客户端数据库加载所有数据更快,然后在一步中索引它。此外,为了避免在其它重要系统运行过久的情况下被阻塞,一对较大的客户端被设置为总是首先运行(表的主索引是通过clientid字段)。我们开始做的另一件事是并行加载来自几个客户端的数据,而不是每个客户端。

Currently, the philosophy is that it's faster to load all the data from each client database and then index it afterwards in one step. Also, in the interest of avoiding bogging down other important systems in case it runs long, a couple of the larger clients are set to always run first (the main index on the table is by a clientid field). One other thing we're starting to do is load data from a few clients at a time in parallel, rather than each client sequentially.

所以我的问题是,什么会是最有效的方式来加载这个表?我们是否正在认为以后的索引更好?还是应该在导入数据之前创建索引?我们应该按索引顺序加载表,以避免大量重新排序页面,而不是大客户端?可以并行加载使事情变得更糟,导致多次磁盘访问一次或删除我们的控制顺序的能力?任何其他想法?

So my question is, what would be the most efficient way to load this table? Are we right in thinking that indexing later is better? Or should we create the indexes before importing data? Should we be loading the table in index order, to avoid massive re-ordering of pages, rather than the big clients first? Could loading in parallel make things worse by causing to much disk access all at once or removing our ability to control the order? Any other ideas?

更新

好​​,有事了。我可以在白天做一些基准测试,并且在加载时间没有区别,无论索引是在操作的开始还是结束创建,但我们保存构建索引本身的时间课程几乎立即生成,没有表中的数据)。

Update
Well, something is up. I was able to do some benchmarking during the day, and there is no difference at all in the load time whether the indexes are created at the beginning or at the end of the operation, but we save the time building the index itself (it of course builds nearly instantly with no data in the table).

推荐答案

还要考虑将日志级别设置为BULK LOGGED以最小化对事务日志的写入。只要记得在完成后将其设置回FULL。

Index at the end, yes. Also consider setting the log level setting to BULK LOGGED to minimize writes to the transaction log. Just remember to set it back to FULL after you've finished.

这篇关于导入新数据库表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆