Proc SQL:SAS 如何/何时移动数据 [英] Proc SQL: How / When does SAS Move the Data

查看:39
本文介绍了Proc SQL:SAS 如何/何时移动数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 DBA/R 用户.我刚刚在一个充满 SAS 用户的办公室找到了一份工作,我正试图更好地了解 SAS 的 proc sql 是如何工作的.我知道 SAS 包括一个关系数据库,它包括针对 Oracle 等外部服务器运行 proc sql 的能力.我试图更好地了解它何时/如何决定使用数据库服务器而不是其内部数据库系统.

I am a DBA / R user. I just took a job in an office full of SAS users and I am trying to understand better how SAS' proc sql works. I understand that SAS includes a relational database and it includes the ability to run proc sql against external servers like Oracle. I am trying to better understand when / how it decides to use the database server rather than its internal database system.

我见过一些真正的 S.L.O.W. SAS 代码,我的同事在其中运行了一系列 proc sql 命令.这些程序通常包括 3 - 5 个 proc sql 步骤.每个 proc sql 命令创建一个本地 SAS 表.他们没有使用直通 sql.数据集很大(100 万行+),这些 proc sql 步骤运行缓慢.大多数数据都存在于服务器上.通常有一个小表来定义我们要查看的人口,它位于 SAS 数据文件中,但其他所有内容都位于服务器上.

I have seen some really S. L. O. W. SAS code where my coworkers running a series of proc sql commands. These programs typically include 3 - 5 proc sql steps. Each proc sql command creates a local SAS table. They are not using passthrough sql. The data sets are large (1 million rows +) and these proc sql steps run slowly. Most of the data lives on the server. There is usually a small table that defines the population that we want to look at and it is in a SAS data file, but everything else lives on the server.

通过直接在服务器上运行所有查询,我已经证明了速度的显着提高.(在这种情况下是 Oracle,但我认为这并不重要.)通常,我必须首先将一个表上传到我的个人模式,该模式定义了我们要检查的客户群.其他一切都在服务器上.有时我会将他们的查询合并在一起,因为它们可以在一个步骤中完成,但我不相信这就是为什么我的他们程序版本的速度如此之快.

I have demonstrated dramatic improvements in speed by simply running all of the queries directly on the server. (Oracle in this case, but I don't think that is important.) Usually, I have to first upload a table to my personal schema that defines the population of clients we want to examine. Everything else is on the server. Sometimes I collapse their queries together because they can be done in a single step, but I do not believe that is why my version of their program is so much faster.

我认为 proc sql 上传初始数据集,然后在服务器上运行第一个查询.然后将输出下载到本地计算机,创建本地 SAS 数据集.对于第二个 proc sql 步骤,它将在第一步中创建的表上传回服务器,然后在服务器上运行查询.更糟糕的是,本地"SAS 数据集实际上存储在远程服务器上,而不是实际的本地机器上.这对 SAS 来说是不可见的,但这确实意味着我们正在再次通过网络复制数据.我认为 SAS 运行缓慢是因为大量不必要的网络流量.

I think proc sql uploads the initial data set and then runs the first query on the server. It then downloads the output to the local computer, creating the local SAS data set. For the second proc sql step, it uploads the table created in step one back to the server and then runs the query on the server. To make this all even worse, the "local" SAS data sets are actually stored on a remote server, not the actual local machine. This is invisible to SAS, but it does mean we are copying data across the network yet again. I believe SAS is running slowly because of a large amount of unnecessary network traffic.

问题 #1 - 我对 proc sql 的理解是否正确?我们真的浪费了我认为通过网络上传和下载大型表格/数据集的时间吗?

Question #1 - Is my understanding of what proc sql is doing correct? Are we really wasting as much time as I think we are uploading and downloading large tables / data sets across our network?

问题 #2 - 有什么方法可以控制 proc sql 何时针对服务器运行,而何时针对本地数据库运行?在某些情况下,如果我们可以阻止上传/下载步骤,查询会更有效地运行.

Qeustion #2 - Is there some way to control when proc sql runs against a server versus when it runs against the local database? In some cases, if we could prevent the upload / download step, the query would run more efficiently.

推荐答案

简短回答

您的理解并不完全正确,但它在正确的范围内.SQL 可能不会将 SAS 数据集发送到服务器,它更有可能将服务器数据下载到 SAS - 但它可能正在下载整个表,不受连接条件的限制.您的解决方案正是我所建议的 - 希望您的同事能够加入.

Short answer

Your understanding is not exactly correct, but it's in the right ballpark. SQL is probably not sending the SAS dataset to the server, it is more likely downloading the server data to SAS - but it's probably downloading the entire table, not limited by the join criteria. Your solution is exactly what I would suggest doing - hopefully your colleagues will get on board.

就处理的工作方式而言,这取决于您的代码.PROC SQL 将在本地执行代码(如在 SAS 服务器/桌面上),除非它决定将查询传递给服务器并且未被告知不允许这样做.这称为隐式直通.您无法真正控制它,除非将其完全关闭(在 PROC SQL 语句中使用 noipassthru).您有时可以使用 options msglevel=i;(系统选项)和 _METHOD_TREE 来查看 SQL 决定做什么(类似于解释计划).

In terms of how the processing works, it depends on your code. PROC SQL will execute code locally (as in, on the SAS server/desktop), unless it decides to pass the query up to the server and hasn't been told it's not allowed to. That's called implicit passthrough. You can't really control it except to turn it entirely off (with noipassthru on the PROC SQL statement). You can look at it sometimes using options msglevel=i; (a system option), and _METHOD or _TREE to see what SQL decided to do (similar to explain plan).

我遇到过它造成危害的情况:SQL Server 不区分大小写地运行字符比较,而 SAS 不这样做,而且我有一个特定的查询,有时会发送到服务器,有时不取决于数据的详细信息.我在检查大小写时不够小心,所以当它确实不正确时它似乎可以工作(将 Propcase 与 UPCASE 进行比较).

I've had cases where it caused harm: SQL Server runs character comparisons case-insensitively while SAS does not, and I had a particular query that sometimes was sent up to the server and sometimes not depending on details of the data. I wasn't careful enough with checking case, and so it appeared to work when it really wasn't correct (comparing Propcase to UPCASE).

一般规则是 SAS 会在以下情况下尝试将查询发送到服务器:

The general rule is that SAS will try to send the query to the server if:

  • 查询中的数据已经完全驻留在服务器上
  • 查询非常简单,SAS 可以轻松找出如何用其母语告诉服务器执行此操作

如果您使用本地 SAS 数据集运行查询(例如,将服务器表在本地连接到 SAS 数据集),则它不会(至少据我所知)不会转到服务器.它应该始终在本地运行,这意味着从服务器下载贡献表中的所有数据(如果查询中有逻辑过滤器,则可能会被过滤).IE(这些示例不一定是好的 SQL 代码,只是概念示例):

If you're running a query with local SAS dataset (say, joining a server table to a SAS dataset locally), it won't (at least as far as I know) go to the server. It should always run it locally, which would mean downloading from the server all data in the contributing tables (possibly filtered if there is a logical filter in the query). IE (these examples aren't necessarily good SQL code, just examples of concept):

libname oralib oracle [connection info];
proc sql;
*Will pass through likely;
select tableA.*, tableB.cost 
  from oralib.tableA inner join oralib.tableB 
  on tableA.id=tableB.id;
*Will probably not pass through;
select tableA.*, tableB.cost 
  from oralib.tableA inner join work.tableB
  on tableA.id=tableB.id;
*Might pass through, might not;
select tableA.*, tableB.cost, tableC.productID
  from oralib.tableA inner join oralib.tableB 
  on tableA.id=tableB.id
  left join oralib.tableC
  on tableA.id=tableC.id;
*This downloads the data but probably applies the where statement server side;
select tableA.*, tableB.cost 
  from oralib.tableA inner join work.tableB
  on tableA.id=tableB.id
  where tableA.date < '01JAN2010'd;
quit;

在第二个查询的情况下,它可能会拉下所有 tableA.在第四个查询中,它可能会将 where 子句传递给服务器(假设日期不会导致问题,但不应该,SAS 知道如何将日期转换为 oracle 类型的日期).

In the case of the second query, it probably pulls all of tableA down. In the fourth query, it likely will pass the where clause to the server (assuming the date doesn't cause a problem, but it shouldn't, SAS knows how to convert dates to oracle type dates).

请注意,SAS 过程也可以生成直通.PROC MEANS等,将指令发送给Oracle做means/sums/etc.如果它可以轻松做到.

Note that SAS procs can also generate passthrough. PROC MEANS, etc., will send the instructions to Oracle to do the means/sums/etc. if it can easily do so.

最好的办法是:

  1. 尝试做所有可以通过的事情(这是有道理的).确保它进入服务器的唯一方法是使用直通.
  2. 如果您在服务器上有一张大表,而在 SAS 中有一张小表,请将 SAS 中的表上传到服务器.直通会话和 libname 会话无法看到彼此的特定于会话的临时表,因此您必须使用 GTT 或类似的(所有用户都可以看到的).同样,如果您在 SAS 中有一个大表,而在 SQL 中有一个小表(或小查询结果),请在本地将其关闭(必要时通过直通).
  3. 当你不得不减少事情的时候,尽可能地限制.当我在这种环境中工作时,只需加入服务器上的表以限制结果集,然后再关闭它们,我就节省了大量时间.

归根结底,无论您做什么,都会受到网络流量的限制;尽你所能尝试优化它.听起来您已经了解如何执行此操作,因此只需执行您在非 SAS 环境中通常会执行的操作即可.

At the end of the day, you will be constrained by network traffic no matter what you do; just try to optimize it as best you can. It sounds like you understand how to do that already, so just do what you normally would do in non-SAS environments.

这篇关于Proc SQL:SAS 如何/何时移动数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆