从一组排序的字符串中生成唯一的ID [英] Generating a unique id from a sorted set of strings

查看:212
本文介绍了从一组排序的字符串中生成唯一的ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串数组,即Array [Array [String]]

这些字符串是文件系统中的文件夹名称,大约有100k个唯一的文件夹名称.

因此数据结构将如下所示:

I have an Array of Array of Strings i.e. Array[Array[String]]

The strings are folder names in a filesystem and about 100k unique folder names are possible.

So the data structure will look like:

Array[ <- Outer Array which is not sorted, 100 million in length
       Array [/ABC/DEF,/XYZ/YTR,.......] <- This inner array is sorted on folder names
       Array [/CDE/FRT,/TUV/HYT,........] <- Want to generate a shorter unique id for each one of this
    ]


对于每个数组,我要生成一个唯一ID的外部数组中的文件夹名称数组.我知道简单的字符串哈希等将导致冲突,因此并不安全.但是我想知道是否有任何方法可以利用对内部数组进行排序以生成哈希算法的事实.我最多可以输入500个字符串.有Java/Scala库可以执行此操作吗?假设我无法对此数据集进行groupBy等操作.

我尝试过的事情:

在Internet上做了一些研究.


For each one the array of folder names in the outer array I want to generate a unique id. I know that simple hashing etc. of Strings will lead to collisions and hence isn''t safe. But I was wondering if there was any way to exploit the fact that the inner array is sorted to generate a hashing algorithm. I can go upto a 500 character string. Any Java/Scala library that does this? Assume I can''t do a groupBy etc. on this dataset.

What I have tried:

Did some research on the internet.

推荐答案

内部数组中每个条目的索引都是唯一的.如果将它与外部数组中的索引连接在一起,则您的最终ID是全局唯一的.还是我想念什么?

顺便说一句,哈希是不可避免的哈希冲突.如果需要唯一性,则需要另一种技术.
The index of every entry in the inner array is unique. If you concatenate it with the index in the outer array then your final id is globally unique. Or am I missing something?

By the way, hash is hash collisions are inevitable. You need another technique if uniqueness is a requirement.


这篇关于从一组排序的字符串中生成唯一的ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆