如何为对象生成(几乎)唯一的哈希ID? [英] How do I generate an (almost) unique hash ID for objects?

查看:1133
本文介绍了如何为对象生成(几乎)唯一的哈希ID?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  class MyClass {$> 

如何获取我的对象的ID以便于与其他对象区分? b $ b private String s;
私人MySecondClass c;
私人收藏< someInterface>科尔;
// ..多一个

结果calculate(){
/ *递归地使用所有字段值来计算结果* /
/ *需要大量的的时间。已实施* /
返回结果;


public String hash(){
/ *递归地使用所有字段值来生成唯一标识符* /
// ?????

calculate()通常需要大约40秒才能完成。因此,我不想多次调用它。

MyClass 对象非常庞大(〜60 MB)。计算的结果值将只有〜100 KB。



每当我要运行计算一个对象,我的程序应该先查看一下,如果已经有一段时间已经完成了,那么递归地使用完全相同的值。如果是这样,它会在(例如) HashMap 中查找结果。基本上, MyClass 对象本身可以用作键,但HashMap将包含30-200个元素 - 我显然不希望以全尺寸存储所有这些元素。这就是为什么我要存储30-200 Hash / result 值。



我会在我的 MyClass 对象内的所有值上生成一个ID(散列)。我该怎么做?通过这种方式,我可以使用非常哈希来查找结果。 我知道像MD5这样的散列码不能保证100%的唯一性,因为多个对象可能具有相同的散列。但是,如果我通过MD5存储(最多)200个元素,我认为两次散列的机会是可以忽略的。可能有 16 ^ 32 = 3.4e38 不同的哈希码。我很乐意听到任何有关它的评论,或者看到其他方法。

一旦生成了散列,我不再需要该对象,只是它的各自 result value。



具有相同值的两个独立对象必须返回相同的哈希码。就像原来的hashCode()一样,我试图保持唯一性。两个对象具有相同散列码的概率应该是绝对可忽略的。



我不知道如何用其他语言描述问题。如果需要进一步澄清,请询问。



那我该如何生成我的 MyClass.hash()



问题并不在于如何或在何处存储哈希,因为我甚至不知道如何为整个对象生成(几乎)唯一的哈希,




澄清:



在谈论大小时,我的意思是硬盘上的序列化大小。



我不认为把对象放在HashMap会减少它们的大小。这是我想要存储一些散列字符串。 HashMap< hashStringOfMyClassObject,resultValue>


无论是作为一个键还是作为一个值),都不会创建它的副本。因此,在HashMap中存储200个大对象比200个对象本身消耗的内存要少。

我自己不存储200个大对象。我只保留200个不同的结果(作为值),它们很小,并且200个相应的MyClass对象的hashCode也很小。 散列对象的目的是为了能够使用散列而不是对象值本身。

解决方案

如果你想创建一个所有数据的哈希值,你需要确保你可以从它们中获取所有字节格式的值。



做这是最好的,如果你能控制所有的类(可能除了Java内置类),那么你可以添加一个方法来做到这一点。



<鉴于你的对象是非常大的,它可能不是一个好主意,只是递归地收集到一个大字节数组,然后计算摘要。最好创建 MessageDigest 对象,并添加一个方法,如:

  void updateDigest(MessageDigest md); 

给他们每个人。如果您愿意,可以为此声明一个接口。每个这样的方法都会收集参与大计算的类自己的数据,并用该数据更新 md 对象。在更新所有数据后,它应递归地调用定义了该方法的任何类的 updateDigest 方法。



例如,如果您有一个包含字段的类:

  int myNumber; 
String myString;
MyClass myObj; // MyClass有updateDigest方法
Set< MyClass> otherObjects;

然后它的 updateDigest 方法应该做些什么像这样:

pre $ //更新当前对象中的plain值
byte [] myStringBytes = myString.getBytes(StandardCharsets.UTF_8);
ByteBuffer buff = ByteBuffer.allocate(
Integer.SIZE / 8 //对于myNumber
+ Integer.SIZE / 8 //对于myString的长度
+ myStringBytes.length
);
buff.putInt(myNumber);
buff.putInt(myStringBytes.length);
buff.put(myStringBytes);
buff.flip();
md.update(buff);

//递归
myObj.updateDigest(md); (MyClass obj:otherObjects){
obj.updateDigest(md);


}

我将字符串的长度(实际上是字节表示的长度)加到摘要是为了避免出现两个字符串字段的情况:

  String field1 =ABCD; 
字符串field2 =EF;

如果您将它们的字节一个接一个地直接放入摘要中,它将具有相同的效果在摘要中为:

  String field1 =ABC; 
字符串field2 =DEF;

这可能会为两组不同的数据生成相同的摘要。因此,添加长度将消除它的歧义。



我使用了一个 ByteBuffer ,因为添加东西比较方便 int double



你不控制,也不能添加方法,你必须有创意。毕竟,您确实从每个这样的课程获得了计算值,所以您可以调用相同的方法并对其结果进行摘要。或者你可以消化它们的序列化表单,如果它们是可序列化的。



所以在你的头班,你将创建 md 对象使用 MessageDigest.getInstance(SHA)或您希望使用的任何摘要。

  MessageDigest md = null; 
尝试{
md = MessageDigest.getInstance(SHA);
} catch(NoSuchAlgorithmException e){
//正确处理
}

//用类自己的数据调用md.update并使用
/ / updateDigest内部对象的方法

//计算摘要
byte [] result = md.digest();

//转换为字符串以便能够在哈希映射中使用
BigInteger mediator = new BigInteger(1,result);
String key = String.format(%040x,mediator);

(您实际上可以使用 BigInteger 本身作为关键)。


How can I get an ID for my objects that makes it easy to distinguish it from others?

class MyClass {
    private String s;
    private MySecondClass c;
    private Collection<someInterface> coll;
    // ..many more

    public Result calculate() {
        /* use all field values recursively to calculate the result */
        /* takes considerable amount of time. Implemented */
        return result;
    }

    public String hash() {
        /* use all field values recursively to generate a unique identifier */
        // ?????
}

calculate() usually takes ~40 seconds to complete. Thus, I do not want to call it multiple times.

MyClass objects are quite huge (~60 MB). The Result value of the calculation will only be ~100 KB.

Whenever I am about to run the calculation on an object, my program should look up if that has been done some time earlier already, with the exact same values, recursively. If so, it will look up the result in (e.g.) a HashMap instead. Basically, MyClass objects itself could be used as keys, but the HashMap will include 30-200 elements - I obviously don't want to store all of that in full size. That's why I want to store 30-200 Hash/result values instead.

So, I thought I'd generate a ID (hash) over all values inside my MyClass object. How do I do that? This way, I can use that very hash to look up the result. I am aware that a hash code like MD5 will not guarantee 100% uniqueness, because multiple objects might have the same hash. However, if I store (at maximum) 200 elements via MD5, the chance for a twice used hash will be neglectible, I think. There are 16^32=3.4e38 different hash codes possible. I'll be happy to hear anybodys comments about it, or see other approaches.

Once the hash is generated, I don't need that object anymore, just its respective result value.

Two seperate objects with the exact same values have to return the same hash code. Much like original hashCode(), just with that I'm trying to maintain uniqueness. The probability for two objects having the same hash code should be absolutely neglectible.

I don't know how to describe the problem in other words anymore. If further clarification is needed, please ask.

So how can I generate my MyClass.hash()?

The problem isn't really about how or where to store the hashes, because I don't even know how I can generate an (almost) unique hash for an entire object, that will always be the same for same values.


Clarification:

When talking of size, I mean the serialized size on the hard drive.

I don't think putting the objects in a HashMap would decrease their size. That's whay I want to store some hash String instead. HashMap<hashStringOfMyClassObject, resultValue>

When you put an object in a HashMap (either as a key or as a value), you don't create a copy of it. So storing 200 large objects in a HashMap consumes little more memory than the 200 objects themselves.

I do not store 200 large objects themselves. I only keep 200 different results (as values) which are small, and 200 respective hashCodes of MyClass objects which are also very small. The point of "hashing" the objects is to be able to work with the hash instead of with the object values themselves.

解决方案

If you want to create a hash of all of your data, you'll need to make sure that you can get all the values in byte format from them.

To do this, it's best if you have control of all the classes (except the Java built-in ones, perhaps), so that you can add a method to them to do this.

Given that your object is very large, it will probably not be a good idea to just collect it into one big byte array recursively and then calculate the digest. It's probably better to create the MessageDigest object, and add a method such as:

void updateDigest( MessageDigest md );

to each of them. You can declare an interface for this if you wish. Each such method will collect the class's own data that participates in the "big calculation" and update the md object with that data. After updating all its own data, it should recursively call the updateDigest method of any classes in it that have that method defined.

For example, if you have a class with fields:

int myNumber;
String myString;
MyClass myObj;  // MyClass has the updateDigest method
Set<MyClass> otherObjects;

Then its updateDigest method should be doing something like this:

// Update the "plain" values that are in the current object
byte[] myStringBytes = myString.getBytes(StandardCharsets.UTF_8);
ByteBuffer buff = ByteBuffer.allocate(
                        Integer.SIZE / 8    // For myNumber
                        + Integer.SIZE / 8  // For myString's length
                        + myStringBytes.length
                  );
buff.putInt( myNumber );
buff.putInt( myStringBytes.length );
buff.put( myStringBytes );
buff.flip();
md.update(buff);

// Recurse
myObj.updateDigest(md);

for ( MyClass obj : otherObjects ) {
    obj.updateDigest(md);
}

The reason I added the string's length (actually, its byte representation's length) to the digest is to avoid situations where you have two String fields:

String field1 = "ABCD";
String field2 = "EF";

If you just put their bytes directly into the digest one after the other, it will have the same effect on the digest as:

String field1 = "ABC";
String field2 = "DEF";

And this may cause an identical digest to be generated for two different sets of data. So adding the length will disambiguate it.

I used a ByteBuffer because it's relatively convenient to add things to it like int and double.

If you have classes that you don't control and cannot add a method to, you'll have to be creative. After all, you do get the values from every such class for the calculation, so you may call the same methods and digest their results. Or you could digest their serialized form if they are serializable.

So in your head class you'll create the md object using MessageDigest.getInstance("SHA") or whatever digest you wish to use.

MessageDigest md = null;
try {
    md = MessageDigest.getInstance("SHA");
} catch (NoSuchAlgorithmException e) {
    // Handle properly
}

// Call md.update with class's own data and recurse using
// updateDigest methods of internal objects

// Compute the digest
byte [] result = md.digest();

// Convert to string to be able to use in a hash map
BigInteger mediator = new BigInteger(1,result);
String key = String.format("%040x", mediator);

(You could actually use the BigInteger itself as the key).

这篇关于如何为对象生成(几乎)唯一的哈希ID?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆