维基百科转储表页面链接的问题 [英] Issues with wikipedia dump table pagelinks

查看:111
本文介绍了维基百科转储表页面链接的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从dumps.wikimedia.org/enwiki/latest/下载了enwiki-latest-pagelinks.sql.gz转储.

我破坏了该文件,其未压缩大小为37G.

I upacked the file, its uncompressed size is 37G.

表结构是这样:

SHOW CREATE TABLE wp_dump.pagelinks;

CREATE TABLE `pagelinks` (
  `pl_from` int(8) unsigned NOT NULL DEFAULT '0',
  `pl_namespace` int(11) NOT NULL DEFAULT '0',
  `pl_title` varbinary(255) NOT NULL DEFAULT '',
  `pl_from_namespace` int(11) NOT NULL DEFAULT '0',
  UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
  KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`),
  KEY `pl_backlinks_namespace` (`pl_from_namespace`,`pl_namespace`,`pl_title`,`pl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary

我将表导入了一个新的空数据库:

I imported the table into a new, empty database:

mysql -D wp_dump -u root -p < enwiki-latest-pagelinks.sql

我在其上运行任务的计算机具有16G的RAM,并且mysql数据库位于SSD上,因此我假设尽管表大小大,导入也不会花费太长时间.

The computer I am running the task on has 16G of RAM and the mysql database is located on a SSD, so I was assuming that despite the table's size the import would not take too long.

但是,该任务自一天以来一直在运行,并且仍在运行.没有其他进程可以访问mysql,并且计算机上没有工作负载.

However, the task is running since over a day and still running. There are no other processes accessing mysql and there is no workload on the computer.

现在数据库文件本身的大小为79G.

The database file itself now is 79G large.

ls -lh

-rw-r----- 1 mysql mysql   65 May 11 17:40 db.opt
-rw-r----- 1 mysql mysql 8,6K May 12 07:06 pagelinks.frm
-rw-r----- 1 mysql mysql  79G May 13 16:59 pagelinks.ibd

该表现在已超过5亿行.

The table now has over 500 million rows.

SELECT table_name, table_rows FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'wp_dump';

+------------+------------+
| table_name | table_rows |
+------------+------------+
| pagelinks  |  520919860 |
+------------+------------+

我想知道:

enwiki-latest-pagelinks.sql是否真的超过79G?

Is the enwiki-latest-pagelinks.sql really over 79G large?

pagelinks真的包含超过5亿行吗?

Does pagelinks really contain over 500 million rows?

导入pagelinks表真的需要那么长时间吗?

Does it really take that long to import the pagelinks table?

能否为您提供一些指标,例如预期的表格大小和行数?

Can you provide some metrics the expected table size and the row amount, please?

更新:2017年5月14日:

Update: 14th may, 2017:

insert仍在运行; pagelinks.ibd文件现在为130G;现在的行数将近7亿

insert still running; pagelinks.ibdfile now 130G; number of rows now almost 700 million

更新:2017年5月16日:

Update: 16th may, 2017:

insert仍在运行; pagelinks.ibd文件现在为204G;现在的行数超过12亿

insert still running; pagelinks.ibdfile now 204G; number of rows now over 1.2 billion

我计算了过去两天每秒插入的行:

I calculated the rows inserted per second over the last two days:

行/秒= 3236

而且:sql脚本中每个插入语句有数千个插入(head -41 enwiki-latest-pagelinks.sql | tail -1 | grep -o "(" | wc -l是30471)

And: It is many thousand inserts per insert statement in the sql script (head -41 enwiki-latest-pagelinks.sql | tail -1 | grep -o "(" | wc -l is 30471)

因此,我的后续问题/修改过的问题:

So, my follow-up / modified questions:

给定37G的sql文件大小和表结构(如上所列),行数和idb文件大小是否可以预期?

Is the number of rows and the idb file size to be expected given the sql file size of 37G and the table structure (as listed above)?

rows/sek = 3236是否是一个很好的值(意味着插入表需要几天的时间)?

Is rows/sek = 3236 a good value (meaning that it takes a few days to insert the table)?

限制速度因素可能是什么/如何加快导入速度?

What may be the limiting speed factor / how can I speed up the import?

  • 禁用索引(并在插入后计算索引)?
  • 优化事务(提交(脚本中未设置任何内容)/autocommit(现在为ON))?
  • 优化变量设置(例如innodb_buffer_pool_size,现在为134217728)?
  • Disable the indexes (and calculate them after the insert)?
  • Optimize transactions (commit (nothing set in script) / autocommit (now ON))?
  • Optimize variable settings (e.g. innodb_buffer_pool_size, now 134217728)?

推荐答案

37GB的数据-> 79GB的InnoDB表似乎是合理的...

37GB of data --> 79GB of InnoDB table seems reasonable...

  • 标题:2个引号和1个逗号-> 1个字节的长度
  • 整数:几个字节,加上逗号-> INT的4个字节(与INT之后的(...)无关.请参见MEDIUMINT.
  • 每行
  • 20-30字节的开销
  • BTree的开销为20-40%.
  • UNIQUE索引变为PRIMARY KEY并与数据聚类->开销很小.
  • 其他两个索引:每个索引实际上与数据大小相同.这样可以增加尺寸.
  • Title: 2 quotes and 1 comma --> 1 byte for length
  • Ints: several bytes, plus comma --> 4 bytes for INT (regardless of the (...) after INT. See MEDIUMINT.
  • 20-30 bytes overhead per row
  • 20-40% overhead for BTrees.
  • UNIQUE index becomes PRIMARY KEY and clusters with data --> very little overhead.
  • Other two indexes: Each is virtually the same size as the data. This more that allows for the increased size.

将它们加在一起,我希望该表超过120GB.因此,可能缺少一些细节.猜测:转储是每个INSERT行,而不是较为冗长的many-rows-per INSERT.

Adding it all together, I would expect the table to be more than 120GB. So, there are probably some details missing. A guess: The dump is one row per INSERT, instead of the less verbose many-rows-per-INSERT.

关于性能,这完全取决于SELECTs.将innodb_buffer_pool_size设置为11G左右.该 可能足以有效地缓存79G.

As for performance, it all depends on the SELECTs. Set innodb_buffer_pool_size to somewhere around 11G. This may work efficiently enough for caching the 79G.

更多

为清楚起见,并且因为InnoDB确实需要PK,将UNIQUE更改为PRIMARY.

Change UNIQUE to PRIMARY, for clarity and because InnoDB really needs a PK.

检查源数据.它是按(pl_frompl_namespacepl_title)顺序排列的吗?如果没有,可以在加载之前对文件进行排序吗?如果可以的话,仅此一项就应该大大提高速度.

Check the source data. Is it in (pl_from,pl_namespace,pl_title) order? If not, can you sort the file before loading? If you can, that, alone, should significantly help the speed.

buffer_pool的128MB也极大地阻碍了进度.

128MB for the buffer_pool is also significantly hampering progress.

这篇关于维基百科转储表页面链接的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆