将CSV文件(包含空字符串和重复项)导入DynamoDB [英] Importing a CSV file (with empty strings and duplicates) into DynamoDB

查看:1314
本文介绍了将CSV文件(包含空字符串和重复项)导入DynamoDB的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,我想要导入到Amazon DynamoDB。所以我把它上传到S3,建立一个EMR集群,并创建一个外部表如下:

I have a CSV file that I'm trying to import to Amazon DynamoDB. So I upload it to S3, set up a EMR cluster, and create an external table like this:

hive> CREATE EXTERNAL TABLE s3_table_myitems (colA BIGINT, colB STRING, colC STRING, colD DOUBLE, colE DOUBLE, colF STRING, colG STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES ('serialization.null.format'='""')
    STORED AS TEXTFILE
    LOCATION 's3://bucketname/dirname/'
    TBLPROPERTIES ('skip.header.line.count'='1');

CSV中的任何列都可能为空,但DynamoDB无法处理空字符串 com.amazonaws.AmazonServiceException:一个或多个参数值无效:AttributeValue不能包含空字符串)。

Any of the columns in the CSV may be empty, but DynamoDB can't deal with empty strings ("com.amazonaws.AmazonServiceException: One or more parameter values were invalid: An AttributeValue may not contain an empty string").

这是亚马逊说的用法:


我们将在未来版本的
中考虑这个可选的忽略空字符串行为。 ...作为解决方法,您可以将空的
属性值转换为NULL。例如,您可以使用更复杂的
SELECT表达式将空字符串转换为其他值,包括
将它们设置为NULL。

We will consider this optional "ignore empty string" behavior in a future release. … As a workaround, you could … transform empty attribute values into NULLs. For example, you can … use a more complex SELECT expression to turn empty strings into something else, including setting them to NULL.

所以这是我想出的,但它看起来很丑陋:

So this is what I came up with, but it looks ugly:

hive> INSERT INTO TABLE ddb_tbl_ingredients
    SELECT
    regexp_replace(colA, '^$', 'NULL'),
    regexp_replace(colB, '^$', 'NULL'),
    regexp_replace(colC, '^$', 'NULL'),
    regexp_replace(colD, '^$', 'NULL'),
    regexp_replace(colE, '^$', 'NULL'),
    regexp_replace(colF, '^$', 'NULL'),
    regexp_replace(colG, '^$', 'NULL')
    FROM s3_table_ingredients;

是否有一个更好的解决方案的整体问题(缺乏预处理CSV)最好是 SELECT 语法?

Is there a better solution to the overall problem (short of pre-processing the CSV), or at least a better SELECT syntax?

/ strong>:我最后不得不处理重复项( com.amazonaws.AmazonServiceException:提供的项目键列表包含重复)。

对于后代,这里是我的完整流程。我很想听到一个更好的方式这样做,无论是美观和性能。任务看起来很简单(将CSV文件导入DynamoDB),但是到目前为止已经花费了几个小时:P

For posterity, here's my complete flow. I'd love to hear of a better way of doing this, both for aesthetics and for performance. The task is seemingly simple ("importing a CSV file into DynamoDB") but coming up with this has taken hours so far :P

# source
hive> CREATE EXTERNAL TABLE s3_table_myitems (colA STRING, colB STRING, colC DOUBLE, colD DOUBLE, colE STRING, colF STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES ('serialization.null.format'='""')
    STORED AS TEXTFILE
    LOCATION 's3://bucketname/dirname/'
    TBLPROPERTIES ('skip.header.line.count'='1');

# destination
hive> CREATE EXTERNAL TABLE ddb_tbl_myitems (colA STRING, colB STRING, colC DOUBLE, colD DOUBLE, colE STRING, colF STRING)
    STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
    TBLPROPERTIES ("dynamodb.table.name" = "myitems",
        "dynamodb.column.mapping" = "colA:colA,colB:colB,colC:colC,colD:colD,colE:colE,colF:colF");

# remove dupes - http://stackoverflow.com/a/34165762/594211
hive> CREATE TABLE tbl_myitems_deduped AS
    SELECT colA, min(colB) AS colB, min(colC) AS colC, min(colD) AS colD, min(colE) AS colE, min(colF) AS colF
    FROM (SELECT colA, colB, colC, colD, unit, colF, rank() OVER
        (PARTITION BY colA ORDER BY colB, colC, colD, colE, colF)
        AS col_rank FROM s3_table_myitems) t
    WHERE t.col_rank = 1
    GROUP BY colA;

# replace empty strings with placeholder 'NULL'
hive> CREATE TABLE tbl_myitems_noempty AS
    SELECT colA,
    regexp_replace(colB, '^$', 'NULL') AS colB,
    regexp_replace(colC, '^$', 'NULL') AS colC,
    regexp_replace(colD, '^$', 'NULL') AS colD,
    regexp_replace(colE, '^$', 'NULL') AS colE,
    regexp_replace(colF, '^$', 'NULL') AS colF
    FROM tbl_myitems_deduped
    WHERE LENGTH(colA) > 0;

# ...other preprocessing here...

# insert to DB
hive> INSERT INTO TABLE ddb_tbl_myitems
    SELECT * FROM tbl_myitems_noempty;

注意: colA 是分区键。

推荐答案

您可以向create table语句添加其他表格属性,将任何指定的字符视为空值。

You can add additional table properties to your create table statement that will treat any specified character as a null value.

TBLPROPERTIES('serialization.null.format'='');

这篇关于将CSV文件(包含空字符串和重复项)导入DynamoDB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆