SAS Hash Tables:有没有办法在不同的键上查找/加入或有可选的键 [英] SAS Hash Tables: Is there a way to find/join on different keys or have optional keys

查看:21
本文介绍了SAS Hash Tables:有没有办法在不同的键上查找/加入或有可选的键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常处理一些键不完美的数据,我需要加入来自不同来源的数据,我想继续使用哈希对象以获得速度优势,但是当我使用大量数据时,我可以遇到崩溃(内存限制).

I frequently work with some data for which the keys are not perfect, and I need to join data from a difference source, I want to continue using Hash Objects for the speed advantage however when I am using a lot of data I can run into crashes (memory restraints).

简化的概述是我有 2 个不同的键,它们都是唯一的,但不是每条记录都存在,我们将它们称为 Key1 和 Key2.

A simplified overview is I have 2 different keys which are all unique but not present for every record, we will call them Key1 and Key2.

我目前的解决方案不是很优雅(但它有效)是执行以下操作:

My current solution, which is not very elegant (but it works) is to do the following:

if _N_ = 1 then do;
   declare hash h1(Dataset:"DataSet1");
                h1.DefineKey("key1");
                h1.DefineData("Value");
                h1.DefineDone();
   declare hash h2(Dataset:"DataSet1");
                h2.DefineKey("key2");
                h2.DefineData("Value");
                h2.DefineDone();
end;

set DataSet2;

rc = h1.find();
if rc NE 0 then do;
    rc = h2.find();
end;

所以我在两个哈希表中有完全相同的数据集,但是定义了 2 个不同的键,如果第一个键没有找到,那么我尝试找到第二个键.

So I have exactly the same dataset in two hash tables, but with 2 different keys defined, if the first key is not found, then I try to find the second key.

有没有人知道一种方法可以提高效率/更容易阅读/减少内存占用?

Does anyone know of a way to make this more efficient/easier to read/less memory intensive?

抱歉,如果这似乎是完成任务的糟糕方式,我绝对欢迎批评,以便我可以学习!

Apologies if this seems a bad way to accomplish the task, I absolutely welcome criticism so I can learn!

提前致谢,

亚当.

推荐答案

我有一个非常相似的问题,但我的处理方式略有不同.

I have a fairly similar problem that I approached slightly differently.

首先:无论问题如何,Stu 所说的所有内容都值得牢记.

First: all of what Stu says is good to keep in mind, regardless of the issue.

如果您处于无法真正减小字符变量大小的情况(请记住,无论数据集大小如何,所有数字在 RAM 中都是 8 个字节,因此不要因此尝试缩小它们),你可以这样处理.

If you are in a situation though where you can't really reduce the character variable size (remember, all numerics are 8 bytes in RAM no matter what the dataset size, so don't try to shrink them for this reason), you can approach it this way.

  1. 构建一个以 key1 为键、key2 为数据以及您的实际数据的哈希表.确保 key1 是更好"的.key - 填充得更充分的那个.将 Key2 重命名为其他变量名,以确保不会覆盖您的真实 key2.
  2. 在 key1 上搜索.如果找到 key1,太好了!继续前进.
  3. 如果缺少 key1,则使用 hiter 对象(哈希迭代器)遍历所有记录以搜索您的 key2.

如果 key2 被大量使用,这不是很有效.与使用 hiter 不同的方式也可能会更好地完成第 3 步 - 例如,您可以为这些记录执行键控集或其他操作.在我的特定情况下,表和查找都缺少 key1,因此可以简单地迭代缺少 key1 的小得多的子集 - 如果在您的情况下这不是真的,并且您的主表已完全填充两个键,那么这会慢很多.

This is not very efficient if key2 is used a lot. Step 3 also might be better done in a different way than using a hiter - you could do a keyed set or something else for those records, for example. In my particular case, both the table and the lookup were missing key1, so it was possible to simply iterate over the much smaller subset missing key1 - if in your case that's not true, and your master table is fully populated for both keys, then this is going to be a lot slower.

我要考虑的另一件事是放弃哈希表并使用键集、格式或其他不使用 RAM 的东西.

The other thing I'd consider is abandoning hash tables and using a keyed set, or a format, or something else that doesn't use RAM.

或拆分您的数据集:

data haskey1 nokey1;
  set yourdata;
  if missing(key1) then output nokey1;
  else output haskey1;
run;

然后是两个数据步骤,一个带有 key1 的哈希值,一个带有 key2 的哈希值,然后将两者重新组合在一起.

Then two data steps, one with a hash with key1 and one with a hash with key2, then combine the two back together.

哪一个最有效很大程度上取决于您的数据集大小(主数据集和查找数据集)以及 key1 的缺失.

Which of these is the most efficient depends heavily on your dataset sizes (both master and lookup) and on the missingness of key1.

这篇关于SAS Hash Tables:有没有办法在不同的键上查找/加入或有可选的键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆