维基百科转储表页面链接的问题 [英] Issues with wikipedia dump table pagelinks

查看：111 发布时间：2020/8/11 5:49:30 mysql performance size wikipedia dump

本文介绍了维基百科转储表页面链接的问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我从dumps.wikimedia.org/enwiki/latest/下载了enwiki-latest-pagelinks.sql.gz转储.

我破坏了该文件，其未压缩大小为37G.

I upacked the file, its uncompressed size is 37G.

表结构是这样:

SHOW CREATE TABLE wp_dump.pagelinks;

CREATE TABLE `pagelinks` (
  `pl_from` int(8) unsigned NOT NULL DEFAULT '0',
  `pl_namespace` int(11) NOT NULL DEFAULT '0',
  `pl_title` varbinary(255) NOT NULL DEFAULT '',
  `pl_from_namespace` int(11) NOT NULL DEFAULT '0',
  UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
  KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`),
  KEY `pl_backlinks_namespace` (`pl_from_namespace`,`pl_namespace`,`pl_title`,`pl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary

我将表导入了一个新的空数据库:

I imported the table into a new, empty database:

mysql -D wp_dump -u root -p < enwiki-latest-pagelinks.sql

我在其上运行任务的计算机具有16G的RAM，并且mysql数据库位于SSD上，因此我假设尽管表大小大，导入也不会花费太长时间.

The computer I am running the task on has 16G of RAM and the mysql database is located on a SSD, so I was assuming that despite the table's size the import would not take too long.

但是，该任务自一天以来一直在运行，并且仍在运行.没有其他进程可以访问mysql，并且计算机上没有工作负载.

However, the task is running since over a day and still running. There are no other processes accessing mysql and there is no workload on the computer.

现在数据库文件本身的大小为79G.

The database file itself now is 79G large.

ls -lh

-rw-r----- 1 mysql mysql   65 May 11 17:40 db.opt
-rw-r----- 1 mysql mysql 8,6K May 12 07:06 pagelinks.frm
-rw-r----- 1 mysql mysql  79G May 13 16:59 pagelinks.ibd

该表现在已超过5亿行.

The table now has over 500 million rows.

SELECT table_name, table_rows FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'wp_dump';

+------------+------------+
| table_name | table_rows |
+------------+------------+
| pagelinks  |  520919860 |
+------------+------------+

我想知道:

enwiki-latest-pagelinks.sql是否真的超过79G?

Is the enwiki-latest-pagelinks.sql really over 79G large?

pagelinks真的包含超过5亿行吗?

Does pagelinks really contain over 500 million rows?

导入pagelinks表真的需要那么长时间吗?

Does it really take that long to import the pagelinks table?

能否为您提供一些指标，例如预期的表格大小和行数?

Can you provide some metrics the expected table size and the row amount, please?

更新:2017年5月14日:

Update: 14th may, 2017:

insert仍在运行； pagelinks.ibd文件现在为130G；现在的行数将近7亿

insert still running; pagelinks.ibdfile now 130G; number of rows now almost 700 million

更新:2017年5月16日:

Update: 16th may, 2017:

insert仍在运行； pagelinks.ibd文件现在为204G；现在的行数超过12亿

insert still running; pagelinks.ibdfile now 204G; number of rows now over 1.2 billion

我计算了过去两天每秒插入的行:

I calculated the rows inserted per second over the last two days:

行/秒= 3236

而且:sql脚本中每个插入语句有数千个插入(head -41 enwiki-latest-pagelinks.sql | tail -1 | grep -o "(" | wc -l是30471)

And: It is many thousand inserts per insert statement in the sql script (head -41 enwiki-latest-pagelinks.sql | tail -1 | grep -o "(" | wc -l is 30471)

因此，我的后续问题/修改过的问题:

So, my follow-up / modified questions:

给定37G的sql文件大小和表结构(如上所列)，行数和idb文件大小是否可以预期?

Is the number of rows and the idb file size to be expected given the sql file size of 37G and the table structure (as listed above)?

rows/sek = 3236是否是一个很好的值(意味着插入表需要几天的时间)?

Is rows/sek = 3236 a good value (meaning that it takes a few days to insert the table)?

限制速度因素可能是什么/如何加快导入速度?

What may be the limiting speed factor / how can I speed up the import?

禁用索引(并在插入后计算索引)?
优化事务(提交(脚本中未设置任何内容)/autocommit(现在为ON))?
优化变量设置(例如innodb_buffer_pool_size，现在为134217728)?

Disable the indexes (and calculate them after the insert)?
Optimize transactions (commit (nothing set in script) / autocommit (now ON))?
Optimize variable settings (e.g. innodb_buffer_pool_size, now 134217728)?

维基百科转储表页面链接的问题 [英] Issues with wikipedia dump table pagelinks

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

维基百科转储表页面链接的问题 [英] Issues with wikipedia dump table pagelinks

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭