Proc Groovy将较大的XML解析为SAS [英] Proc Groovy to parse larger XML into SAS
问题描述
我们尝试使用SAS XML映射器读取3-4 GB的XML文件.但是,当我们将数据从XML引擎复制到SAS数据集时,大约需要5到6分钟,这对于我们来说是太多时间,因为我们不得不每天处理3000个文件.我们并行运行近10个文件.一个表几乎有230列.
We tried reading 3-4 GB of XML file using SAS XML mapper .but when we PROC COPY the data from the XML engine to SAS Dataset its taking almost 5 to 6 mins which is too much time for us since we have to process 3000 files a day .We are running almost 10 files in parallel.One table almost have 230 columns.
还有其他更快的方法来处理XML吗?我们可以使用PROC GROOVY吗?会有效吗?如果可以,可以给我提供示例代码吗?我尝试在线搜索,但无法获得搜索结果.
Is there any other faster way to process the XML ? can we use PROC GROOVY ? will it be efficient? if yes can any one provide me a sample code? i tried searching online but not able to get one.
XML具有PII数据,其容量为3 GB.
The XML has PII data and its huge of 3 GB .
正在运行的代码非常简单明了:
The Code being run is very simple and straight forward:
filename NHL "/path/ODM.xml";
filename map "/path/odm_map.map";
libname NHL xmlv2 xmlmap=map;
proc copy in=nhl out=work;
run;
创建的表总数:54,其中14个以上的表具有〜18000条记录,其余的表具有〜1000条记录
Total Table created : 54 in which more than 14 tables have ~18000 records and remaining tables have ~1000 records
显示日志"窗口
NOTE: PROCEDURE COPY used (Total process time):
real time 4:03.72
user cpu time 4:00.68
system cpu time 1.17 seconds
memory 32842.37k
OS Memory 52888.00k
Timestamp 19/05/2020 03:14:43 PM
Step Count 4 Switch Count 802
Page Faults 3
Page Reclaims 17172
Page Swaps 0
Voluntary Context Switches 3662
Involuntary Context Switches 27536
Block Input Operations 504
Block Output Operations 56512
SAS Version : 9.4_M2
我们的服务器中的总内存大小为 MEMSIZE = 3221225472
total memsize is MEMSIZE=3221225472
in our server
总共3000个文件,其中1000个为3到4 GB,其中一些为1 GB,1000个文件以KB为单位.较小的文件很快得到处理,问题仅在于大文件.它几乎使用整个CPU.
3000 files total out of which 1000 will be 3 to 4 GB and some of which will be 1 GB and 1000 files will be in KB .The smaller files are getting processed quickly the problem is only with big files .it uses almost the entire CPU.
当我们减少文件数量时,来自XML引擎的复制时间会有所不同,但是要实现此目的,我们必须更改映射文件或输入xml.
The copy time from XML engine varies when we reduce the number of file,but for that to happen we have to change the map file or the input xml.
已经提出了SAS跟踪,并且在SAS社区中对SAS提出了质疑,但仍然运气不佳.看起来像它的解析器限制本身.
Already raised SAS tracks and have questioned the same in SAS communities still no luck.looks like its parser limitation itself.
您对Teradata中的碎纸机有任何想法吗?会有效吗?
Any idea about the shredder in Teradata ? will it be efficient?
推荐答案
我分两步进行操作,首先将XML转换为ascii,然后转换为SAS.SAS在将XML转换为SAS方面不会很快.这并不是SAS进行了优化的.您几乎全部用完了CPU时间,因此不受磁盘限制-SAS解析XML文件的能力受到限制.
I would do this in two pieces, first convert XML to ascii and then into SAS. SAS isn't going to be very fast at converting XML into SAS; it's just not something SAS is optimized for. You're using nearly entirely CPU time, so you're not disk limited - you're limited by SAS's ability to parse the XML file.
用一种更优化的语言编写一个程序,该程序可以更快地解析XML,然后将其结果读入SAS.Python可能是一个选择-它也没有 super 进行优化,但是它比我怀疑的SAS对这种事情进行了更优化-甚至更低级的语言(例如c/c ++)也可能是您最好的选择
Write a program in a more optimized language that can parse the XML much faster, and then read the results of that into SAS. Python might be one option - it's not super optimized either, but it's more optimized for this sort of thing than SAS I suspect - or an even lower level language (like c/c++) might be your best bet.
这篇关于Proc Groovy将较大的XML解析为SAS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!