SAS哈希表:有没有办法找到/加入不同的键或具有可选键 [英] SAS Hash Tables: Is there a way to find/join on different keys or have optional keys

查看:95
本文介绍了SAS哈希表:有没有办法找到/加入不同的键或具有可选键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常处理一些键不完美的数据,并且需要从其他来源连接数据,我想继续使用哈希对象以提高速度,但是当我使用大量数据时,我可以遇到崩溃(内存限制).

I frequently work with some data for which the keys are not perfect, and I need to join data from a difference source, I want to continue using Hash Objects for the speed advantage however when I am using a lot of data I can run into crashes (memory restraints).

一个简化的概述是,我有2个不同的键,它们都是唯一的,但对于每条记录都不存在,我们将它们称为Key1和Key2.

A simplified overview is I have 2 different keys which are all unique but not present for every record, we will call them Key1 and Key2.

我当前的解决方案虽然不是很优雅(但是可以用),但它可以执行以下操作:

My current solution, which is not very elegant (but it works) is to do the following:

if _N_ = 1 then do;
   declare hash h1(Dataset:"DataSet1");
                h1.DefineKey("key1");
                h1.DefineData("Value");
                h1.DefineDone();
   declare hash h2(Dataset:"DataSet1");
                h2.DefineKey("key2");
                h2.DefineData("Value");
                h2.DefineDone();
end;

set DataSet2;

rc = h1.find();
if rc NE 0 then do;
    rc = h2.find();
end;

所以我在两个哈希表中具有完全相同的数据集,但是定义了2个不同的键,如果找不到第一个键,那么我尝试查找第二个键.

So I have exactly the same dataset in two hash tables, but with 2 different keys defined, if the first key is not found, then I try to find the second key.

有没有人知道一种使这种方法更有效/更容易读取/减少内存密集型的方法?

Does anyone know of a way to make this more efficient/easier to read/less memory intensive?

抱歉,如果这似乎是完成任务的一种不好方法,我绝对欢迎提出批评,以便我可以学习!

Apologies if this seems a bad way to accomplish the task, I absolutely welcome criticism so I can learn!

预先感谢

亚当.

推荐答案

我有一个非常相似的问题,我的解决方法略有不同.

I have a fairly similar problem that I approached slightly differently.

首先:无论问题如何,都应牢记Stu所说的所有内容.

First: all of what Stu says is good to keep in mind, regardless of the issue.

如果您处在无法真正减小字符变量大小的情况下(请记住,无论数据集大小如何,RAM中的所有数字均为8字节,因此,请勿尝试缩小它们),您可以通过这种方式进行处理.

If you are in a situation though where you can't really reduce the character variable size (remember, all numerics are 8 bytes in RAM no matter what the dataset size, so don't try to shrink them for this reason), you can approach it this way.

  1. 使用key1作为键,key2作为数据以及您的实际数据构建一个哈希表.确保key1是更好"的密钥.密钥-填充更充分的密钥.重命名Key2为其他变量名称,以确保您不会覆盖实际的key2.
  2. 搜索key1.如果找到key1,那就太好了!继续前进.
  3. 如果key1丢失,则使用一个击打对象(哈希迭代器)遍历所有搜索key2的记录.

如果经常使用key2,这不是很有效.第3步也可能比使用击打手更好,例如,您可以对这些记录进行键设置或其他操作.在我的特殊情况下,表和查找都缺少key1,因此可以简单地遍历缺少key1的小得多的子集-如果在您的情况下不正确,并且主表中都填充了两个键,则这会慢很多.

This is not very efficient if key2 is used a lot. Step 3 also might be better done in a different way than using a hiter - you could do a keyed set or something else for those records, for example. In my particular case, both the table and the lookup were missing key1, so it was possible to simply iterate over the much smaller subset missing key1 - if in your case that's not true, and your master table is fully populated for both keys, then this is going to be a lot slower.

我要考虑的另一件事是放弃哈希表并使用键集,格式或其他不使用RAM的东西.

The other thing I'd consider is abandoning hash tables and using a keyed set, or a format, or something else that doesn't use RAM.

或拆分您的数据集:

data haskey1 nokey1;
  set yourdata;
  if missing(key1) then output nokey1;
  else output haskey1;
run;

然后是两个数据步骤,一个是带有key1的哈希值,另一个是带有key2的哈希值,然后将两者合并在一起.

Then two data steps, one with a hash with key1 and one with a hash with key2, then combine the two back together.

其中哪种效率最高,很大程度上取决于您的数据集大小(主数据集和查找数据)以及key1的缺失.

Which of these is the most efficient depends heavily on your dataset sizes (both master and lookup) and on the missingness of key1.

这篇关于SAS哈希表:有没有办法找到/加入不同的键或具有可选键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆