截断的MD5的ECDF图 [英] ECDF plot from a truncated MD5

查看:96
本文介绍了截断的MD5的ECDF图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在此l 墨水中,它表示截断的MD5是均匀分布的.我想使用PySpark进行检查,并首先在Python中创建了1,000,000个UUID,如下所示.然后截断MD5中的前三个字符.但是我得到的图与均匀分布的累积分布函数不相似.我尝试使用UUID1和UUID4,结果相似.符合截断的MD5均匀分布的正确方法是什么?

In this link, it says that truncated MD5 is uniformly distributed. I wanted to check it using PySpark and I created 1,000,000 UUIDs in Python first as shown below. Then truncated the first three characters from MD5. But the plot I get is not similar to the cumulative distribution function of a uniform distribution. I tried with UUID1 and UUID4 and the results are similar. What is the right way of conforming the uniform distribution of truncated MD5?

import uuid
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
import pandas as pd
import pyspark.sql.functions as f
%matplotlib inline

### Generate 1,000,000 UUID1 

uuid1 = [str(uuid.uuid1()) for i in range(1000000)]  # make a UUID based on the host ID and current time
uuid1_df = pd.DataFrame({'uuid1':uuid1})
uuid1_spark_df =  spark.createDataFrame(uuid1_df)
uuid1_spark_df = uuid1_spark_df.withColumn('hash', f.md5(f.col('uuid1')))\
               .withColumn('truncated_hash3', f.substring(f.col('hash'), 1, 3))

count_by_truncated_hash3_uuid1 = uuid1_spark_df.groupBy('truncated_hash3').count()

uuid1_count_list = [row[1] for row in count_by_truncated_hash3_uuid1.collect()]
ecdf = ECDF(np.array(uuid1_count_list))
plt.figure(figsize = (14, 8))
plt.plot(ecdf.x,ecdf.y)
plt.show()

我添加了直方图.如下所示,它看起来更像是正态分布.

I added the histogram. As you can see below, it looks more like normal distribution.

  plt.figure(figsize = (14, 8))
  plt.hist(uuid1_count_list)
  plt.title('Histogram of counts in each truncated hash')
  plt.show()

推荐答案

以下是演示此问题的快捷方法:

Here is a quick-and-dirty way to demonstrate this:

import hashlib
import matplotlib.pyplot as plt
import numpy as np
import random

def random_string(n):
    """Returns a uniformly distributed random string of length n."""
    return ''.join(chr(random.randint(0, 255)) for _ in range(n))

# Generate 100K random strings
data = [random_string(10) for _ in range(100000)]
# Compute MD5 hashes
md5s = [hashlib.md5(d.encode()).digest() for d in data]
# Truncate each MD5 to the first three characters and convert to int
truncated_md5s = [md5[0] * 0x10000 + md5[1] * 0x100 + md5[2] for md5 in md5s]

# (Rather crudely) compute and plot the ECDF    
hist = np.histogram(truncated_md5s, bins=1000)
plt.plot(hist[1], np.cumsum([0] + list(hist[0])))

这篇关于截断的MD5的ECDF图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆