将大型 xml 文件加载到 Snowflake 并按标签展平 [英] Loading large xml files into Snowflake and flattening by tag

查看:24
本文介绍了将大型 xml 文件加载到 Snowflake 并按标签展平的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些非常大的 XML 文件需要处理.我曾经使用 Spark 处理它们,但我正在从 SQLDW 转移到 Snowflake,所以我不能再使用 Spark.在 Spark 中,有一个通过向 spark 函数提供rowTag"来扁平化 XML 文件的概念.假设我们有这个 persons.xml 文件:

I have some extremely large XML files that I need to process. I used to process them using Spark, but I am moving away from SQLDW and onto Snowflake, so I can no longer use Spark. In Spark, there was a concept of flattening XML files by providing a "rowTag" to a spark function. Let us say we have this persons.xml file:

<persons>
    <person id="1">
        <firstname>James</firstname>
        <lastname>Smith</lastname>
        <middlename></middlename>
        <dob_year>1980</dob_year>
        <dob_month>1</dob_month>
        <gender>M</gender>
        <salary currency="Euro">10000</salary>
        <addresses>
            <address>
                <street>123 ABC street</street>
                <city>NewJersy</city>
                <state>NJ</state>    
            </address>
            <address>
                <street>456 apple street</street>
                <city>newark</city>
                <state>DE</state>    
            </address>    
        </addresses>    
    </person>
    <person id="2">
        <firstname>Michael</firstname>
        <lastname></lastname>
        <middlename>Rose</middlename>
        <dob_year>1990</dob_year>
        <dob_month>6</dob_month>
        <gender>M</gender>
        <salary currency="Dollor">10000</salary>
        <addresses>
            <address>
                <street>4512 main st</street>
                <city>new york</city>
                <state>NY</state>    
            </address>
            <address>
                <street>4367 orange st</street>
                <city>sandiago</city>
                <state>CA</state>    
            </address>    
        </addresses>            
    </person>
</persons>

如果我想将这个 XML 文件展平,使其看起来像一个带有标题 firstname、lastname、middlename、dob_year、dob_month... 等的 CSV,我将运行一个如下所示的函数:

If I want to flatten this XML file to look like a CSV with headers firstname, lastname, middlename, dob_year, dob_month... etc, I would run a function that looks like this:

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .load("persons.xml");
display(df);

通过在 .option() 函数中提供 rowTag person 的 spark ,我们得到一个如下所示的数据帧:

By providing spark the rowTag person in the .option() function, we get a dataframe that looks like this:

_id addresses   dob_month   dob_year    firstname   gender  lastname    middlename  salary                          
1   {"address":[{"city":"NewJersy","state":"NJ","street":"123 ABC street"},{"city":"newark","state":"DE","street":"456 apple street"}]} 1   1980    James   M   Smith       {"_VALUE":10000,"_currency":"Euro"}
2   {"address":[{"city":"new york","state":"NY","street":"4512 main st"},{"city":"sandiago","state":"CA","street":"4367 orange st"}]}   6   1990    Michael M       Rose    {"_VALUE":10000,"_currency":"Dollor"}

阅读有点困难,所以这里有一张图片可以帮助...

It's a little difficult to read, so here is an image to help...

无论如何,我想知道如何使用 Snowflake 做到这一点,如果可能的话?如果可能,我想避免预处理我的 xml 文件.

Anyways, I was wondering how I could do this with Snowflake, if it is possible? I would like to avoid pre-processing my xml file if possible.

请记住,这些文件很大.1Gb+.也无法保证文件的开头或开头附近会包含 rowTag - 它可能在文件下方数百行.

Remember, these files are large. 1Gb+. There is also no guarantee that the files will have the rowTag in the beginning or near the beginning - it could be several hundred lines down the file.

推荐答案

给你的一些想法:

  1. 加载时,使用 STRIP_OUTER_ELEMENT = TRUE 消除 PERSONS 标记,并使每个 PERSON 对象位于其自己的行中.这可以简化数据并允许您加载更大的文件.

  1. On load, use STRIP_OUTER_ELEMENT = TRUE to eliminate the PERSONS tag, and have each PERSON object land in it's own row. This simplifies the data and allows you to load larger files.

展平 XML 以查找所有路径.例如,select *从 my_table a,lateral flatten(input=>a.data, recursive=>true) b;

Flatten the XML to find all the paths. For example, select * from my_table a, lateral flatten(input=>a.data, recursive=>true) b;

将路径从展平符号转换为字段符号并构建您的查询:

Translate the paths from the flatten notation into the field notation and build your query:

例如(假设删除了 PERSONS 外部标签):

For example (assuming PERSONS outer tag removed):

select 
  data:"@id"::number id,
  data:"$"[0]."$"::text first_name,
  data:"$"[1]."$"::text last_name
from my_table; 

其中 data 是您的 XML 列.

Where data is your XML column.

希望有所帮助.

UPDATE -- 用于上述查询的示例 XML:

UPDATE -- Sample XML to use with query above:

create or replace table my_table as
select parse_xml($1) as data 
from values ('
    <person id="1">
        <firstname>James</firstname>
        <lastname>Smith</lastname>
        <middlename></middlename>
        <dob_year>1980</dob_year>
        <dob_month>1</dob_month>
        <gender>M</gender>
        <salary currency="Euro">10000</salary>
        <addresses>
            <address>
                <street>123 ABC street</street>
                <city>NewJersy</city>
                <state>NJ</state>    
            </address>
            <address>
                <street>456 apple street</street>
                <city>newark</city>
                <state>DE</state>    
            </address>    
        </addresses>    
    </person>'),('
    <person id="2">
        <firstname>Michael</firstname>
        <lastname></lastname>
        <middlename>Rose</middlename>
        <dob_year>1990</dob_year>
        <dob_month>6</dob_month>
        <gender>M</gender>
        <salary currency="Dollor">10000</salary>
        <addresses>
            <address>
                <street>4512 main st</street>
                <city>new york</city>
                <state>NY</state>    
            </address>
            <address>
                <street>4367 orange st</street>
                <city>sandiago</city>
                <state>CA</state>    
            </address>    
        </addresses>            
    </person>
');

这篇关于将大型 xml 文件加载到 Snowflake 并按标签展平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆