在 SAS 中使用 Proc sql 和 Teradata 编写高效查询 [英] Writing Efficient Queries in SAS Using Proc sql with Teradata

查看:65
本文介绍了在 SAS 中使用 Proc sql 和 Teradata 编写高效查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里有一组更完整的代码,可以准确显示每个下面的答案发生了什么.

Here is a more complete set of code that shows exactly what's going on per the answer below.

libname output '/data/files/jeff'
%let DateStart = '01Jan2013'd;
%let DateEnd = '01Jun2013'd;
proc sql;
CREATE TABLE output.id AS (
  SELECT DISTINCT id
  FROM mydb.sale_volume AS sv
  WHERE sv.category IN ('a', 'b', 'c') AND
    sv.trans_date BETWEEN &DateStart AND &DateEnd
)
CREATE TABLE output.sums AS (
  SELECT id, SUM(sales)
  FROM mydb.sale_volue AS sv
  INNER JOIN output.id AS ids
    ON ids.id = sv.id
  WHERE sv.trans_date BETWEEN &DateStart AND &DateEnd
  GROUP BY id
)
run;

目标是简单地根据类别成员资格在表中查询某些 ID.然后我总结这些成员在所有类别中的活动.

The goal is to simply query the table for some id's based on category membership. Then I sum these members' activity across all categories.

上面的方法远远慢于:

  1. 运行第一个查询以获取子集
  2. 运行第二个查询每个 ID 的总和
  3. 运行第三个内部连接两个结果集的查询.

如果我理解正确,确保我的所有代码完全通过而不是交叉加载可能更有效.

If I'm understanding correctly, it may be more efficient to make sure that all of my code is completely passed through rather than cross-loading.

昨天发布了一个问题后,一位成员建议我提出一个单独的关于性能的问题,这个问题更适合我的情况.

After posting a question yesterday, a member suggested I might benefit from asking a separate question on performance that was more specific to my situation.

我正在使用 SAS Enterprise Guide 编写一些程序/数据查询.我无权修改存储在Teradata"中的基础数据.

I'm using SAS Enterprise Guide to write some programs/data queries. I don't have permissions to modify the underlying data, which is stored in 'Teradata'.

我的基本问题是在这种环境中编写高效的 SQL 查询.例如,我查询一个大表(包含数千万条记录)以获取 ID 的一个小子集.然后,我使用这个子集再次查询更大的表:

My basic problem is writing efficient SQL queries in this environment. For example, I query a large table (with tens of millions of records) for a small subset of ID's. Then, I use this subset to query the larger table again:

proc sql;
CREATE TABLE subset AS (
  SELECT
    id
  FROM
    bigTable
  WHERE
    someValue = x AND
    date BETWEEN a AND b

)

这会在几秒钟内起作用并返回 90k ID.接下来,我想针对大表查询这组 ID,问题随之而来.我想随着时间的推移对 ID 的值求和:

This works in a matter of seconds and returns 90k ID's. Next, I want to query this set of ID's against the big table, and problems ensue. I'm wanting to sum values over time for the ID's:

proc sql;
CREATE TABLE subset_data AS (
  SELECT
    bigTable.id,
    SUM(bigTable.value) AS total
  FROM
    bigTable
  INNER JOIN subset
    ON subset.id = bigTable.id
  WHERE
    bigTable.date BETWEEN a AND b
  GROUP BY
    bigTable.id
)

无论出于何种原因,这都需要很长时间.不同之处在于第一个查询标记了someValue".第二个查看所有活动,无论someValue"中有什么.例如,我可以标记每个订购披萨的客户.然后我会查看所有订购披萨的客户的每笔购买.

For whatever reason, this takes a really long time. The difference is that the first query flags 'someValue'. The second looks at all activity, regardless of what's in 'someValue'. For example, I could flag every customer who orders a pizza. Then I would look at every purchase for all customers who ordered pizza.

我对 SAS 不太熟悉,所以我正在寻找有关如何更有效地执行此操作或加快速度的任何建议.我愿意接受任何想法或建议,如果我能提供更多细节,请告诉我.我想我很惊讶第二个查询需要这么长时间来处理.

I'm not overly familiar with SAS so I'm looking for any advice on how to do this more efficiently or speed things up. I'm open to any thoughts or suggestions and please let me know if I can offer more detail. I guess I'm just surprised the second query takes so long to process.

推荐答案

在使用 SAS 访问 Teradata(或任何其他与此相关的外部数据库)中的数据时要了解的最关键的事情是 SAS 软件准备 SQL 并提交它到数据库.这个想法是尝试让您(用户)从所有数据库特定的细节中解脱出来.SAS 使用称为隐式传递"的概念来实现这一点,这意味着 SAS 将 SAS 代码转换为 DBMS 代码.发生的许多事情包括数据类型转换:SAS 只有两种(并且只有两种)数据类型,数字和字符.

The most critical thing to understand when using SAS to access data in Teradata (or any other external database for that matter) is that the SAS software prepares SQL and submits it to the database. The idea is to try and relieve you (the user) from all the database specific details. SAS does this using a concept called "implict pass-through", which just means that SAS does the translation from SAS code into DBMS code. Among the many things that occur is data type conversion: SAS only has two (and only two) data types, numeric and character.

SAS 负责为您翻译内容,但这可能会令人困惑.例如,我见过用 VARCHAR(400) 列定义的惰性"数据库表,其值永远不会超过某个较小的长度(例如人名的列).在数据库中这不是什么大问题,但由于 SAS 没有 VARCHAR 数据类型,它为每行创建一个 400 个字符宽的变量.即使使用数据集压缩,这确实会使生成的 SAS 数据集变得不必要地大.

SAS deals with translating things for you but it can be confusing. For example, I've seen "lazy" database tables defined with VARCHAR(400) columns having values that never exceed some smaller length (like column for a person's name). In the data base this isn't much of a problem, but since SAS does not have a VARCHAR data type, it creates a variable 400 characters wide for each row. Even with data set compression, this can really make the resulting SAS dataset unnecessarily large.

另一种方法是使用显式传递",即使用相关 DBMS 的实际语法编写本机查询.这些查询完全在 DBMS 上执行并将结果返回给 SAS(它仍然为您进行数据类型转换.例如,这里是一个传递"查询,它执行到两个表的连接并创建一个 SAS 数据集作为结果:

The alternative way is to use "explicit pass-through", where you write native queries using the actual syntax of the DBMS in question. These queries execute entirely on the DBMS and return results back to SAS (which still does the data type conversion for you. For example, here is a "pass-through" query that performs a join to two tables and creates a SAS dataset as a result:

proc sql;
   connect to teradata (user=userid password=password mode=teradata);
   create table mydata as
   select * from connection to teradata (
      select a.customer_id
           , a.customer_name
           , b.last_payment_date
           , b.last_payment_amt
      from base.customers a
      join base.invoices b
      on a.customer_id=b.customer_id
      where b.bill_month = date '2013-07-01'
        and b.paid_flag = 'N'
      );
quit;

请注意,这对括号内的所有内容都是原生 Teradata SQL,并且连接操作本身在数据库内运行.

Notice that everything inside the pair of parentheses is native Teradata SQL and that the join operation itself is running inside the database.

您在问题中显示的示例代码不是 SAS/Teradata 程序的完整工作示例.为了更好地提供帮助,您需要展示真实的程序,包括任何库引用.例如,假设您的真实程序如下所示:

The example code you have shown in your question is NOT a complete, working example of a SAS/Teradata program. To better assist, you need to show the real program, including any library references. For example, suppose your real program looks like this:

proc sql;
   CREATE TABLE subset_data AS
   SELECT bigTable.id,
          SUM(bigTable.value) AS total
   FROM   TDATA.bigTable bigTable
   JOIN   TDATA.subset subset
   ON     subset.id = bigTable.id
   WHERE  bigTable.date BETWEEN a AND b
   GROUP BY bigTable.id
   ;

这将指示先前分配的 LIBNAME 语句,SAS 通过该语句连接到 Teradata.该 WHERE 子句的语法与 SAS 是否能够将完整查询传递给 Teradata 非常相关.(您的示例没有显示a"和b"指的是什么.SAS 执行联接的唯一方法很可能是将两个表拖回本地工作会话并在 SAS 服务器上执行联接.

That would indicate a previously assigned LIBNAME statement through which SAS was connecting to Teradata. The syntax of that WHERE clause would be very relevant to if SAS is even able to pass the complete query to Teradata. (You example doesn't show what "a" and "b" refer to. It is very possible that the only way SAS can perform the join is to drag both tables back into a local work session and perform the join on your SAS server.

我强烈建议您尝试说服 Teradata 管理员允许您在某个实用程序数据库中创建驱动程序"表.这个想法是您将在 Teradata 中创建一个相对较小的表,其中包含您要提取的 ID,然后使用该表执行显式连接.我敢肯定,您需要接受更正式的数据库培训才能做到这一点(例如如何定义适当的索引以及如何收集统计数据"),但是有了这些知识和能力,您的工作就会飞速发展.

One thing I can strongly suggest is that you try to convince your Teradata administrators to allow you to create "driver" tables in some utility database. The idea is that you would create a relatively small table inside Teradata containing the ID's you want to extract, then use that table to perform explicit joins. I'm sure you would need a bit more formal database training to do that (like how to define a proper index and how to "collect statistics"), but with that knowledge and ability, your work will just fly.

我可以继续,但我会停在这里.我每天都广泛使用 SAS 和 Teradata,因为我听说这是地球上最大的 Teradata 环境之一.我喜欢两者的编程.

I could go on and on but I'll stop here. I use SAS with Teradata extensively every day against what I'm told is one of the largest Teradata environments on the planet. I enjoy programming in both.

这篇关于在 SAS 中使用 Proc sql 和 Teradata 编写高效查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆