检测JSON数组中没有重复项的最快正确方法是什么? [英] What is the fastest correct way to detect that there are no duplicates in a JSON array?
问题描述
我需要检查在 serde_json :: Value
数组中所有项目是否唯一.由于此类型无法实现 Hash
,因此我提出了以下解决方案:
I need to check if all items are unique in an array of serde_json::Value
. Since this type does not implement Hash
I came up with the following solution:
use serde_json::{json, Value};
use std::collections::HashSet;
fn is_unique(items: &[Value]) -> bool {
let mut seen = HashSet::with_capacity(items.len());
for item in items.iter() {
if !seen.insert(item.to_string()) {
return false;
}
}
true
}
fn main() {
let value1 = json!([1, 2]);
assert!(is_unique(&value1.as_array().unwrap()));
let value2 = json!([1, 1]);
assert!(!is_unique(&value2.as_array().unwrap()));
}
我假设只有使用 preserve_order
功能构建了 serde_json
(每次都以相同顺序序列化对象)时,它才应该工作,但是我不是100%对此肯定.
I assume that it should only work if serde_json
is built with preserve_order
feature (to have objects serialized in the same order every time), but I am not 100% sure about it.
主要用法上下文:
JSON模式验证." uniqueItems "关键字实现.
JSON Schema validation. "uniqueItems" keyword implementation.
相关使用案例
重复数据删除JSON数组以优化JSON Schema推断.
Deduplication of JSON arrays to optimize JSON Schema inference on them.
例如,输入数据为 [1、2,{"foo":"bar"}]
.直接推断可能会输出以下内容:
For example, the input data is [1, 2, {"foo": "bar"}]
. A straightforward inference might output this:
{
"type": "array",
"items": {
"anyOf": [
{"type": "integer"},
{"type": "integer"},
{"type": "object", "required": ["foo"]}
]
}
}
items/anyOf
中的
值只能减少为两个值.
values in items/anyOf
can be reduced to only two values.
问题:检查任意JSON数组中是否没有重复项的最省时,最正确的方法是什么?
Question: What would be the most time-efficient and correct way to check that there are no duplicates in an arbitrary JSON array?
我使用了 serde_json ="1.0.48"
铁锈:1.42.0
推荐答案
将每个数组项转换为字符串是相当昂贵的–它要求每个项至少分配一个字符串,而且可能还要更多.确保映射(或JSON语言中的对象")以规范形式表示也很困难.
Converting each array item to a string is rather expensive – it requires at least one string allocation per item, and quite likely more than that. It's also difficult to make sure mappings (or "objects" in JSON language) are represented in a canonical form.
一种更快,更强大的替代方法是自己为 Value
实施 Hash
.您需要定义一个新类型包装器,因为您不能在外来类型上实现外来特征.这是一个简单的示例实现:
A faster and more robust alternative is to implement Hash
for Value
yourself. You need to define a newtype wrapper, since you can't implement a foreign trait on a foreign type. Here's a simple example implementation:
use serde_json::Value;
use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;
#[derive(PartialEq)]
struct HashValue<'a>(pub &'a Value);
impl Eq for HashValue<'_> {}
impl Hash for HashValue<'_> {
fn hash<H: Hasher>(&self, state: &mut H) {
use Value::*;
match self.0 {
Null => state.write_u32(3_221_225_473), // chosen randomly
Bool(ref b) => b.hash(state),
Number(ref n) => {
if let Some(x) = n.as_u64() {
x.hash(state);
} else if let Some(x) = n.as_i64() {
x.hash(state);
} else if let Some(x) = n.as_f64() {
// `f64` does not implement `Hash`. However, floats in JSON are guaranteed to be
// finite, so we can use the `Hash` implementation in the `ordered-float` crate.
ordered_float::NotNan::new(x).unwrap().hash(state);
}
}
String(ref s) => s.hash(state),
Array(ref v) => {
for x in v {
HashValue(x).hash(state);
}
}
Object(ref map) => {
let mut hash = 0;
for (k, v) in map {
// We have no way of building a new hasher of type `H`, so we
// hardcode using the default hasher of a hash map.
let mut item_hasher = DefaultHasher::new();
k.hash(&mut item_hasher);
HashValue(v).hash(&mut item_hasher);
hash ^= item_hasher.finish();
}
state.write_u64(hash);
}
}
}
}
None
的值是随机选择的,以使其不太可能与其他条目发生冲突.为了计算浮点数的哈希值,我使用了 ordered-float
条板箱.对于映射,代码为每个键/值对计算一个哈希,然后将这些哈希简单地进行XOR运算,这与顺序无关.不幸的是,我们需要对用于哈希映射条目的哈希器进行硬编码.我们可以通过定义自己的 Hash
特性版本来抽象出来,然后从我们的自定义 Hash
特性,但这会使代码复杂得多,因此除非您需要,否则我不会这样做.
The value for None
is chosen randomly to make it unlikely to collide with other entries. To calculate hashes for floating point numbers, I used the ordered-float
crate. For mappings, the code calculates a hash for each key/value pair and simply XORs these hashes together, which is order-independent. It's a bit unfortunate that we need to hardcode the hasher used for hashing the map entries. We could abstract that out by defining our own version of the Hash
trait, and then derive concrete implementations of std::hash::Hash
from our custom Hash
trait, but this complicates the code quite a bit, so I wouldn't do that unless you need to.
我们无法导出 Eq
,因为 Value
不实现 Eq
.但是,我认为这只是一个疏忽,所以我提出了添加 Eq
实施(PR已被接受,因此它将在将来的版本中发布).
We can't derive Eq
, since Value
does not implement Eq
. However, I believe this is just an oversight, so I filed an issue to add an Eq
implementation (which the PR has been accepted for, so it will land in some future release).
这篇关于检测JSON数组中没有重复项的最快正确方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!