如何在只有最后几个字节发生变化的golang数据中高效地散列(SHA 256) [英] How to efficiently hash (SHA 256) in golang data where only the last few bytes changes

查看:215
本文介绍了如何在只有最后几个字节发生变化的golang数据中高效地散列(SHA 256)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您有80个字节的数据,并且只有最后4个字节不断变化,那么如何使用Go有效地散列总共80个字节。本质上,前76个字节是相同的,而最后4个字节不断变化。理想情况下,您希望保留第一个76字节的散列摘要的副本,并只保留最后4个字节。

您可以尝试 Go Playground 上的以下示例。基准测试结果在最后。

注意:下面的实现对于并发使用是不安全的;我有意让它们变得更简单,速度更快。



只使用公共API时总是最快的(始终散列所有输入)



Go的散列算法的一般概念和接口是 hash.Hash 界面。这不允许您保存散列器的状态并返回或倒回到保存的状态。因此,使用Go标准库的公共哈希API,您总是必须从开始计算哈希。



公共API提供的是重用已经构建的哈希器使用 Hash.Reset()方法计算新输入的哈希值。这很好,所以不需要(内存)分配来计算多个散列值。你也可以利用可选的slice来传递给 Hash.Sum(),它用来附加当前散列值。这很好,所以不需要分配来接收哈希结果。



下面是一个利用以下的例子:

  type Cached1 struct {
hasher hash.Hash
result [sha256.Size] byte
}

func NewCached1()* Cached1 {
return& Cached1 {hasher:sha256.New()}
}

func(c * Cached1)Sum(data [] byte )[] byte [
c.hasher.Reset()
c.hasher.Write(data)
return c.hasher.Sum(c.result [:0])
}



测试数据



我们将使用以下测试数据:

$ pre $ var $ fixed $ bytes $.Repeat([] byte {1},76)
var variantA = [] byte {1,1,1,1}
var variantB = [] byte {2,2,2,2}

var data = append(append( ...],variantA ...)
var data2 = append(append([] byte {},fixed ...),variantB ...)

var c1 = NewCached1()

首先让我们来aut (验证我们的哈希器是否正常工作):

  fmt.Printf(%x \ n,sha256。 Sum256(data))
fmt.Printf(%x\\\
,sha256.Sum256(data2))

输出:

  fb8e69bdfa2ad15be7cc8a346b74e773d059f96cfc92da89e631895422fe966a 
10ef52823dad5d1212e8ac83b54c001bfb9a03dc0c7c3c83246fb988aa788c0c

现在我们来检查我们的 Cached1 hasher:

  fmt.Printf(%x\\\
,c1.Sum(data))
fmt.Printf(%x \ n,c1.Sum (data2))

输出结果相同:

  fb8e69bdfa2ad15be7cc8a346b74e773d059f96cfc92da89e631895422fe966a 
10ef52823dad5d1212e8ac83b54c001bfb9a03dc0c7c3c83246fb988aa788c0c



可能会中断(在将来的Go版本中):只散列最后4个字节

现在让我们看一个不太灵活的解决方案,它真正计算出has h的第一个76固定部分只有一次。

crypto / sha256 包是未导出的 sha256.digest 类型指向这种类型的指针):

  // digest表示对校验和的部分评估。 
类型摘要struct {
h [8] uint32
x [chunk] byte
nx int
len uint64
is224 bool //标记这个摘要是SHA -224
}

digest struct type基本上保持了hasher的当前状态。



我们可以做的是给hasher固定的前76个字节,然后保存这个结构值。当我们需要对80个字节的数据进行哈希处理时,我们使用这个保存的数据作为起始点,然后输入最后4个字节。



请注意,只需保存此结构值就足够了,因为它不包含指针和描述符类型(如切片和贴图)。否则,我们也将不得不复制这些,但我们是幸运的。因此,如果未来的 crypto / sha256 实现会添加指针或切片字段,则需要调整此解决方案。



<由于 sha256.digest 是未导出的,因此我们只能使用反射(

执行此操作的示例:

  type Cached2 struct {
origv reflect.Value
hasherv反映.Value
hasher hash.Hash

result [sha256.Size] byte
}

func NewCached2(fixed [] byte)* Cached2 {
h:= sha256.New()
h.Write(fixed)

c:=& Cached2 {origv:reflect.ValueOf(h).Elem()}
hasherv:= reflect.New(c.origv.Type())
c.hasher = hasherv.Interface()。(hash.Hash)
c.hasherv = hasherv.Elem()

返回c
}

func(c * Cached2)Sum(data [] byte)[] byte {
//设置固定散列的状态:
c.hasherv.Set(c.origv)

c.hasher.Write(data)
return c.hasher.Sum(c.result [:0])
}
<

$ b

测试它: )
fmt.Printf(%x\\\
,c2.Sum(variantA))
fmt.Printf(%x\\\
,c2.Sum(variantB))

输出也是一样的:

  fb8e69bdfa2ad15be7cc8a346b74e773d059f96cfc92da89e631895422fe966a 
10ef52823dad5d1212e8ac83b54c001bfb9a03dc0c7c3c83246fb988aa788c0c

所以它可以。



最终,最快的解决方案如果反思不会涉及。如果我们想要更快的解决方案,只需将 sha256.digest 类型及其方法复制到我们的包中,这样我们就可以直接使用它如果我们这样做,我们可以访问摘要结构体的值,我们可以简单地复制它:

  var d digest 
// init d
saved:= d

恢复它就像:

  d = saved 

我简单地克隆 crypto / sha256 打包到我的工作区,并将 digest 类型更改/导出为 Digest 仅用于演示目的。然后使用这个 mysha256.Digest 类型我实现了 Cached3 像这样:

  type Cached3 struct {
orig mysha256.Digest
result [sha256.Size] byte
}

func NewCached3(fixed [] byte)* Cached3 {
var d mysha256.Digest
d.Reset()
d.Write(fixed)

return& Cached3 {原始:d}
}

func(c * Cached3)Sum(data [] byte)[] byte {
//复制固定散列:
d:= c.orig
d.Write(data)
return d.Sum(c.result [:0])
}


测试它: (固定)

fmt.Printf(%x\\\
,c3.Sum(variantA))
fmt.Printf(%x\\\
,c3.Sum(变种B))

再次输出是一样的。所以这也适用。



基准



我们可以使用此代码对性能进行基准测试:

  func BenchmarkCached1(b * testing.B){
for i:= 0;我< b.N; i ++ {
c1.Sum(data)
c1.Sum(data2)
}
}

func BenchmarkCached2(b * testing.B){
for i:= 0;我< b.N; i ++ {
c2.Sum(variantA)
c2.Sum(variantB)
}
}

func BenchmarkCached3(b * testing.B){
for i:= 0;我< b.N; i ++ {
c3.Sum(variantA)
c3.Sum(variantB)
}
}

基准测试结果( go test -bench。-benchmem ):

  BenchmarkCached1-4 1000000 1569 ns / op 0 B / op 0 allocs / op 
BenchmarkCached2-4 2000000 926 ns / op 0 B / op 0 allocs / op
BenchmarkCached3-4 2000000 872 ns / op 0 B / op 0 allocs / op

Cached2 Cached1 大约快了41%,这非常明显且很好。与 Cached2 ,另一个 6%相比, Cached3 只会带来小小的性能提升。 Cached3 <44%比 Cached1 快。



另外请注意,没有任何解决方案使用任何分配,这也是不错的。

结论



对于额外的40%或44%,我可能不会选择 Cached2 Cached3 解决方案。当然,这取决于表现对你来说有多重要。如果它很重要,我认为 Cached2 解决方案在最小增加的复杂性和显着的性能增益之间提供了一个很好的折中。它确实构成威胁,因为未来的Go实现可能会破坏它;如果这是一个问题, Cached3 通过复制当前的实现来解决这个问题(并且稍微提高了它的性能)。


Assuming you had 80 bytes of data and only the last 4 bytes was constantly changing, how would you efficiently hash the total 80 bytes using Go. In essence, the first 76 bytes are the same, while the last 4 bytes keeps changing. Ideally, you want to keep a copy of the hash digest for the first 76 bytes and just keep changing the last 4.

解决方案

You can try the following examples on the Go Playground. Benchmark results is at the end.

Note: the implementations below are not safe for concurrent use; I intentionally made them like this to be simpler and faster.

Fastest when using only public API (always hashes all input)

The general concept and interface of Go's hash algorithms is the hash.Hash interface. This does not allow you to save the state of the hasher and to return or rewind to the saved state. So using the public hash APIs of the Go standard lib, you always have to calculate the hash from start.

What the public API offers is to reuse an already constructed hasher to calculate the hash of a new input, using the Hash.Reset() method. This is nice so that no (memory) allocations will be needed to calculate multiple hash values. Also you may take advantage of the optional slice that may be passed to Hash.Sum() which is used to append the current hash to. This is nice so that no allocations will be needed to receive the hash results either.

Here's an example that takes advantage of these:

type Cached1 struct {
    hasher hash.Hash
    result [sha256.Size]byte
}

func NewCached1() *Cached1 {
    return &Cached1{hasher: sha256.New()}
}

func (c *Cached1) Sum(data []byte) []byte {
    c.hasher.Reset()
    c.hasher.Write(data)
    return c.hasher.Sum(c.result[:0])
}

Test data

We'll use the following test data:

var fixed = bytes.Repeat([]byte{1}, 76)
var variantA = []byte{1, 1, 1, 1}
var variantB = []byte{2, 2, 2, 2}

var data = append(append([]byte{}, fixed...), variantA...)
var data2 = append(append([]byte{}, fixed...), variantB...)

var c1 = NewCached1()

First let's get authentic results (to verify if our hasher works correctly):

fmt.Printf("%x\n", sha256.Sum256(data))
fmt.Printf("%x\n", sha256.Sum256(data2))

Output:

fb8e69bdfa2ad15be7cc8a346b74e773d059f96cfc92da89e631895422fe966a
10ef52823dad5d1212e8ac83b54c001bfb9a03dc0c7c3c83246fb988aa788c0c

Now let's check our Cached1 hasher:

fmt.Printf("%x\n", c1.Sum(data))
fmt.Printf("%x\n", c1.Sum(data2))

Output is the same:

fb8e69bdfa2ad15be7cc8a346b74e773d059f96cfc92da89e631895422fe966a
10ef52823dad5d1212e8ac83b54c001bfb9a03dc0c7c3c83246fb988aa788c0c

Even faster but may break (in future Go releases): hashes only the last 4 bytes

Now let's see a less flexible solution which truly calculates the hash of the first 76 fixed part only once.

The hasher of the crypto/sha256 package is the unexported sha256.digest type (more precisely a pointer to this type):

// digest represents the partial evaluation of a checksum.
type digest struct {
    h     [8]uint32
    x     [chunk]byte
    nx    int
    len   uint64
    is224 bool // mark if this digest is SHA-224
}

A value of the digest struct type basically holds the current state of the hasher.

What we may do is feed the hasher the fixed, first 76 bytes, and then save this struct value. When we need to caclulate the hash of some 80 bytes data where the first 76 is the same, we use this saved value as a starting point, and then feed the varying last 4 bytes.

Note that it's enough to simply save this struct value as it contains no pointers and no descriptor types like slices and maps. Else we would also have to make a copy of those, but we're "lucky". So this solution would need adjustment if a future implementation of crypto/sha256 would add a pointer or slice field for example.

Since sha256.digest is unexported, we can only use reflection (reflect package) to achieve our goals, which inherently will add some delays to computation.

Example implementation that does this:

type Cached2 struct {
    origv   reflect.Value
    hasherv reflect.Value
    hasher  hash.Hash

    result [sha256.Size]byte
}

func NewCached2(fixed []byte) *Cached2 {
    h := sha256.New()
    h.Write(fixed)

    c := &Cached2{origv: reflect.ValueOf(h).Elem()}
    hasherv := reflect.New(c.origv.Type())
    c.hasher = hasherv.Interface().(hash.Hash)
    c.hasherv = hasherv.Elem()

    return c
}

func (c *Cached2) Sum(data []byte) []byte {
    // Set state of the fixed hash:
    c.hasherv.Set(c.origv)

    c.hasher.Write(data)
    return c.hasher.Sum(c.result[:0])
}

Testing it:

var c2 = NewCached2(fixed)
fmt.Printf("%x\n", c2.Sum(variantA))
fmt.Printf("%x\n", c2.Sum(variantB))

Output is again the same:

fb8e69bdfa2ad15be7cc8a346b74e773d059f96cfc92da89e631895422fe966a
10ef52823dad5d1212e8ac83b54c001bfb9a03dc0c7c3c83246fb988aa788c0c

So it works.

The "ultimate", fastest solution

Cached2 could be faster if reflection would not be involved. If we want an even faster solution, simply we can make a copy of the sha256.digest type and its methods into our package, so we can directly use it without having to resort to reflection.

If we do this, we will have access to the digest struct value, and we can simply make a copy of it like:

var d digest
// init d
saved := d

And restoring it is like:

d = saved

I simply "cloned" the crypto/sha256 package to my workspace, and changed / exported the digest type as Digest just for demonstration purposes. Then using this mysha256.Digest type I implemented Cached3 like this:

type Cached3 struct {
    orig   mysha256.Digest
    result [sha256.Size]byte
}

func NewCached3(fixed []byte) *Cached3 {
    var d mysha256.Digest
    d.Reset()
    d.Write(fixed)

    return &Cached3{orig: d}
}

func (c *Cached3) Sum(data []byte) []byte {
    // Make a copy of the fixed hash:
    d := c.orig
    d.Write(data)
    return d.Sum(c.result[:0])
}

Testing it:

var c3 = NewCached3(fixed)

fmt.Printf("%x\n", c3.Sum(variantA))
fmt.Printf("%x\n", c3.Sum(variantB))

Output again is the same. So this works too.

Benchmarks

We can benchmark performance with this code:

func BenchmarkCached1(b *testing.B) {
    for i := 0; i < b.N; i++ {
        c1.Sum(data)
        c1.Sum(data2)
    }
}

func BenchmarkCached2(b *testing.B) {
    for i := 0; i < b.N; i++ {
        c2.Sum(variantA)
        c2.Sum(variantB)
    }
}

func BenchmarkCached3(b *testing.B) {
    for i := 0; i < b.N; i++ {
        c3.Sum(variantA)
        c3.Sum(variantB)
    }
}

Benchmark results (go test -bench . -benchmem):

BenchmarkCached1-4   1000000     1569 ns/op     0 B/op    0 allocs/op
BenchmarkCached2-4   2000000      926 ns/op     0 B/op    0 allocs/op
BenchmarkCached3-4   2000000      872 ns/op     0 B/op    0 allocs/op

Cached2 is approximately 41% faster than Cached1 which is quite noticable and nice. Cached3 only gives a "little" performance boost compared to Cached2, another 6%. Cached3 is 44% faster than Cached1.

Also note that none of the solutions use any allocations which is also nice.

Conclusion

For that extra 40% or 44%, I would probably not go for the Cached2 or Cached3 solutions. Of course it really depends on how important the performance is to you. If it is important, I think the Cached2 solution presents a fine compromise between minimum added complexity and the noticeable performance gain. It does pose a threat as future Go implementations may break it; if it is a problem, Cached3 solves this by copying the current implementation (and also improves its performance a little).

这篇关于如何在只有最后几个字节发生变化的golang数据中高效地散列(SHA 256)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆