当行具有值'x'时,跳过MySQL LOAD DATA INFILE语句中的行 [英] Skip rows in MySQL LOAD DATA INFILE statement when row has value 'x'

查看:200
本文介绍了当行具有值'x'时,跳过MySQL LOAD DATA INFILE语句中的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:我有一个固定宽度的平面文件,大约有9400万行数据。该文件来自HCUP全国住院病例样本(NIS http://www.hcupus。 ahrq.gov/nisoverview.jsp ),它提供了关于过去12年的住院情况,每一行都是单独的住院治疗。对于我的分析,我将查询诊断代码(ICD9-CM)以识别具有各种诊断的患者。

Background: I have a fixed-width flat file with about 94 million rows of data. The file is from the HCUP Nationwide Inpatient Sample (NIS http://www.hcup-us.ahrq.gov/nisoverview.jsp), which provides information about hospitalizations over the past 12 years, each row a separate hospitalization. For my analyses, I will be querying diagnostic codes (ICD9-CM) to identify patients with various diagnoses.

固定宽度文件包含最多15个诊断代码的信息,这些代码以列dx1到dx15提供。

The fixed-width file contains information on up to 15 diagnostic codes, which are provided as columns dx1 through dx15.

create table `core` (`key` char (14),
`dx1` char (5),
`dx10` char (5),
`dx11` char (5),
`dx12` char (5),
`dx13` char (5),
`dx14` char (5),
`dx15` char (5),
`dx19` char (5),
`dx2` char (5),
`dx3` char (5),
`dx4` char (5),
`dx5` char (5),
`dx6` char (5),
`dx7` char (5),
`dx8` char (5),
`dx9` char (5),
plus various other columns of patient demographics...);

我将所有数据加载到MySQL表中,名为 core ,并且可以索引15列。但是,将dx *列标准化为单独的 dx 表,例如;

I loaded all of the data into a MySQL table, named core, and can index the 15 columns. However, it seems advantageous to kind of normalize the dx* columns into a separate dx table, such as;

create table `dx` (
`key` char (14),
`icd9` char (5),
);

其中 main core 表。要快速将数据加载到 dx 中,我使用:

where key is a foreign key to the main core table. To load the data quickly into dx, I use:

LOAD DATA LOCAL INFILE 'data.ASC' INTO TABLE `dx` (@var1) SET `key` = substr(@var1, 1, 14), `icd9` = substr(@var1, 74, 5);
LOAD DATA LOCAL INFILE 'data.ASC' INTO TABLE `dx` (@var1) SET `key` = substr(@var1, 1, 14), `icd9` = substr(@var1, 79, 5);
LOAD DATA LOCAL INFILE 'data.ASC' INTO TABLE `dx` (@var1) SET `key` = substr(@var1, 1, 14), `icd9` = substr(@var1, 84, 5);
etc for all 15 columns...

固定宽度文件只有3个诊断代码的中值,因此大多数dx *列只是空白('' [五个空白字符])。因此,虽然 dx 表在加载数据后有1.41亿(9400万* 15)行,但是大约12.8亿(9400万* 12)是空白诊断码。

The catch is that the each row in the fixed-width file only has a median of 3 diagnosis codes, so most of the dx* columns are just blank (' ' [five blank characters]). So, while the dx table has 1.41 billion (94 million * 15) rows after loading data, about 1.28 billion (94 million * 12) are blank diagnostic codes.

我一直在删除它们并优化,然后建立索引:

I've been simply removing them afterwards and optimizing, prior to indexing:

DELETE FROM `dx` WHERE `icd9` = "     ";
OPTIMIZE TABLE `dx`;
CREATE INDEX `icd9` ON `dx` (`icd9`);

但这需要很多时间。

问题:可以修改LOAD DATA INFILE语句以跳过 ICD9 = ' [五个空白字符],这将明显快于我当前的DELETE和OPTIMIZE方法?如果有,我想将此信息传递给使用这些数据的未来研究人员。

Question: Is it possible to modify the LOAD DATA INFILE statement to skip the row if ICD9 = ' '[five blank characters], and would this be significantly faster than my current DELETE and OPTIMIZE method? If there is, I would like to pass this information on to future researchers working with these data.

推荐答案


如果

Is it possible to modify the LOAD DATA INFILE statement to skip the row if

否,可以修改LOAD DATA INFILE语句以跳过
行。有一个 IGNORE 选项。但是,它使用行号而不是内联逻辑比较。

No. There is an IGNORE option. However it use line numbers not inline logical comparisons.


会比我目前的DELETE和OPTIMIZE
方法

would this be significantly faster than my current DELETE and OPTIMIZE method

可能。但是,因为它不是一个选项,所以没关系。

Likely. But, as it's not an option, it doesn't matter.

这篇关于当行具有值'x'时,跳过MySQL LOAD DATA INFILE语句中的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆