将大型 xml 文件加载到 Snowflake 并按标签展平 [英] Loading large xml files into Snowflake and flattening by tag
问题描述
我有一些非常大的 XML 文件需要处理.我曾经使用 Spark 处理它们,但我正在从 SQLDW 转移到 Snowflake,所以我不能再使用 Spark.在 Spark 中,有一个通过向 spark 函数提供rowTag"来扁平化 XML 文件的概念.假设我们有这个 persons.xml
文件:
I have some extremely large XML files that I need to process. I used to process them using Spark, but I am moving away from SQLDW and onto Snowflake, so I can no longer use Spark. In Spark, there was a concept of flattening XML files by providing a "rowTag" to a spark function. Let us say we have this persons.xml
file:
<persons>
<person id="1">
<firstname>James</firstname>
<lastname>Smith</lastname>
<middlename></middlename>
<dob_year>1980</dob_year>
<dob_month>1</dob_month>
<gender>M</gender>
<salary currency="Euro">10000</salary>
<addresses>
<address>
<street>123 ABC street</street>
<city>NewJersy</city>
<state>NJ</state>
</address>
<address>
<street>456 apple street</street>
<city>newark</city>
<state>DE</state>
</address>
</addresses>
</person>
<person id="2">
<firstname>Michael</firstname>
<lastname></lastname>
<middlename>Rose</middlename>
<dob_year>1990</dob_year>
<dob_month>6</dob_month>
<gender>M</gender>
<salary currency="Dollor">10000</salary>
<addresses>
<address>
<street>4512 main st</street>
<city>new york</city>
<state>NY</state>
</address>
<address>
<street>4367 orange st</street>
<city>sandiago</city>
<state>CA</state>
</address>
</addresses>
</person>
</persons>
如果我想将这个 XML 文件展平,使其看起来像一个带有标题 firstname、lastname、middlename、dob_year、dob_month...
等的 CSV,我将运行一个如下所示的函数:
If I want to flatten this XML file to look like a CSV with headers firstname, lastname, middlename, dob_year, dob_month...
etc, I would run a function that looks like this:
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "person")
.load("persons.xml");
display(df);
通过在 .option()
函数中提供 rowTag
person
的 spark ,我们得到一个如下所示的数据帧:
By providing spark the rowTag
person
in the .option()
function, we get a dataframe that looks like this:
_id addresses dob_month dob_year firstname gender lastname middlename salary
1 {"address":[{"city":"NewJersy","state":"NJ","street":"123 ABC street"},{"city":"newark","state":"DE","street":"456 apple street"}]} 1 1980 James M Smith {"_VALUE":10000,"_currency":"Euro"}
2 {"address":[{"city":"new york","state":"NY","street":"4512 main st"},{"city":"sandiago","state":"CA","street":"4367 orange st"}]} 6 1990 Michael M Rose {"_VALUE":10000,"_currency":"Dollor"}
阅读有点困难,所以这里有一张图片可以帮助...
It's a little difficult to read, so here is an image to help...
无论如何,我想知道如何使用 Snowflake 做到这一点,如果可能的话?如果可能,我想避免预处理我的 xml 文件.
Anyways, I was wondering how I could do this with Snowflake, if it is possible? I would like to avoid pre-processing my xml file if possible.
请记住,这些文件很大.1Gb+.也无法保证文件的开头或开头附近会包含 rowTag - 它可能在文件下方数百行.
Remember, these files are large. 1Gb+. There is also no guarantee that the files will have the rowTag in the beginning or near the beginning - it could be several hundred lines down the file.
推荐答案
给你的一些想法:
加载时,使用
STRIP_OUTER_ELEMENT = TRUE
消除 PERSONS 标记,并使每个 PERSON 对象位于其自己的行中.这可以简化数据并允许您加载更大的文件.
On load, use
STRIP_OUTER_ELEMENT = TRUE
to eliminate the PERSONS tag, and have each PERSON object land in it's own row. This simplifies the data and allows you to load larger files.
展平 XML 以查找所有路径.例如,select *从 my_table a,lateral flatten(input=>a.data, recursive=>true) b;
Flatten the XML to find all the paths. For example, select *
from my_table a, lateral flatten(input=>a.data, recursive=>true) b;
将路径从展平符号转换为字段符号并构建您的查询:
Translate the paths from the flatten notation into the field notation and build your query:
例如(假设删除了 PERSONS 外部标签):
For example (assuming PERSONS outer tag removed):
select
data:"@id"::number id,
data:"$"[0]."$"::text first_name,
data:"$"[1]."$"::text last_name
from my_table;
其中 data
是您的 XML 列.
Where data
is your XML column.
希望有所帮助.
UPDATE -- 用于上述查询的示例 XML:
UPDATE -- Sample XML to use with query above:
create or replace table my_table as
select parse_xml($1) as data
from values ('
<person id="1">
<firstname>James</firstname>
<lastname>Smith</lastname>
<middlename></middlename>
<dob_year>1980</dob_year>
<dob_month>1</dob_month>
<gender>M</gender>
<salary currency="Euro">10000</salary>
<addresses>
<address>
<street>123 ABC street</street>
<city>NewJersy</city>
<state>NJ</state>
</address>
<address>
<street>456 apple street</street>
<city>newark</city>
<state>DE</state>
</address>
</addresses>
</person>'),('
<person id="2">
<firstname>Michael</firstname>
<lastname></lastname>
<middlename>Rose</middlename>
<dob_year>1990</dob_year>
<dob_month>6</dob_month>
<gender>M</gender>
<salary currency="Dollor">10000</salary>
<addresses>
<address>
<street>4512 main st</street>
<city>new york</city>
<state>NY</state>
</address>
<address>
<street>4367 orange st</street>
<city>sandiago</city>
<state>CA</state>
</address>
</addresses>
</person>
');
这篇关于将大型 xml 文件加载到 Snowflake 并按标签展平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!