Proc Groovy将较大的XML解析为SAS [英] Proc Groovy to parse larger XML into SAS

查看:65
本文介绍了Proc Groovy将较大的XML解析为SAS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们尝试使用SAS XML映射器读取3-4 GB的XML文件.但是,当我们将数据从XML引擎复制到SAS数据集时,大约需要5到6分钟,这对于我们来说是太多时间,因为我们不得不每天处理3000个文件.我们并行运行近10个文件.一个表几乎有230列.

We tried reading 3-4 GB of XML file using SAS XML mapper .but when we PROC COPY the data from the XML engine to SAS Dataset its taking almost 5 to 6 mins which is too much time for us since we have to process 3000 files a day .We are running almost 10 files in parallel.One table almost have 230 columns.

还有其他更快的方法来处理XML吗?我们可以使用PROC GROOVY吗?会有效吗?如果可以,可以给我提供示例代码吗?我尝试在线搜索,但无法获得搜索结果.

Is there any other faster way to process the XML ? can we use PROC GROOVY ? will it be efficient? if yes can any one provide me a sample code? i tried searching online but not able to get one.

XML具有PII数据,其容量为3 GB.

The XML has PII data and its huge of 3 GB .

正在运行的代码非常简单明了:

The Code being run is very simple and straight forward:

filename NHL "/path/ODM.xml";
filename map "/path/odm_map.map";
libname NHL xmlv2 xmlmap=map;
proc copy in=nhl out=work;
run;

创建的表总数:54,其中14个以上的表具有〜18000条记录,其余的表具有〜1000条记录

Total Table created : 54 in which more than 14 tables have ~18000 records and remaining tables have ~1000 records

显示日志"窗口

NOTE: PROCEDURE COPY used (Total process time): 
      real time           4:03.72 
      user cpu time       4:00.68 
      system cpu time        1.17 seconds 
      memory              32842.37k 
      OS Memory           52888.00k 
      Timestamp           19/05/2020 03:14:43 PM 
      Step Count          4 Switch Count 802
      Page Faults 3 
      Page Reclaims 17172 
      Page Swaps 0 
      Voluntary Context Switches 3662 
      Involuntary Context Switches 27536 
      Block Input Operations 504 
      Block Output Operations 56512 

      SAS Version : 9.4_M2 

我们的服务器中的总内存大小为 MEMSIZE = 3221225472

total memsize is MEMSIZE=3221225472 in our server

总共3000个文件,其中1000个为3到4 GB,其中一些为1 GB,1000个文件以KB为单位.较小的文件很快得到处理,问题仅在于大文件.它几乎使用整个CPU.

3000 files total out of which 1000 will be 3 to 4 GB and some of which will be 1 GB and 1000 files will be in KB .The smaller files are getting processed quickly the problem is only with big files .it uses almost the entire CPU.

当我们减少文件数量时,来自XML引擎的复制时间会有所不同,但是要实现此目的,我们必须更改映射文件或输入xml.

The copy time from XML engine varies when we reduce the number of file,but for that to happen we have to change the map file or the input xml.

已经提出了SAS跟踪,并且在SAS社区中对SAS提出了质疑,但仍然运气不佳.看起来像它的解析器限制本身.

Already raised SAS tracks and have questioned the same in SAS communities still no luck.looks like its parser limitation itself.

您对Teradata中的碎纸机有任何想法吗?会有效吗?

Any idea about the shredder in Teradata ? will it be efficient?

推荐答案

我分两步进行操作,首先将XML转换为ascii,然后转换为SAS.SAS在将XML转换为SAS方面不会很快.这并不是SAS进行了优化的.您几乎全部用完了CPU时间,因此不受磁盘限制-SAS解析XML文件的能力受到限制.

I would do this in two pieces, first convert XML to ascii and then into SAS. SAS isn't going to be very fast at converting XML into SAS; it's just not something SAS is optimized for. You're using nearly entirely CPU time, so you're not disk limited - you're limited by SAS's ability to parse the XML file.

用一种更优化的语言编写一个程序,该程序可以更快地解析XML,然后将其结果读入SAS.Python可能是一个选择-它也没有 super 进行优化,但是它比我怀疑的SAS对这种事情进行了更优化-甚至更低级的语言(例如c/c ++)也可能是您最好的选择

Write a program in a more optimized language that can parse the XML much faster, and then read the results of that into SAS. Python might be one option - it's not super optimized either, but it's more optimized for this sort of thing than SAS I suspect - or an even lower level language (like c/c++) might be your best bet.

这篇关于Proc Groovy将较大的XML解析为SAS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆