检测JSON数组中没有重复项的最快正确方法是什么? [英] What is the fastest correct way to detect that there are no duplicates in a JSON array?

查看:104
本文介绍了检测JSON数组中没有重复项的最快正确方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要检查在 serde_json :: Value 数组中所有项目是否唯一.由于此类型无法实现 Hash ,因此我提出了以下解决方案:

I need to check if all items are unique in an array of serde_json::Value. Since this type does not implement Hash I came up with the following solution:

use serde_json::{json, Value};
use std::collections::HashSet;

fn is_unique(items: &[Value]) -> bool {
    let mut seen = HashSet::with_capacity(items.len());
    for item in items.iter() {
        if !seen.insert(item.to_string()) {
            return false;
        }
    }
    true
}

fn main() {
    let value1 = json!([1, 2]);
    assert!(is_unique(&value1.as_array().unwrap()));
    let value2 = json!([1, 1]);
    assert!(!is_unique(&value2.as_array().unwrap()));
}

我假设只有使用 preserve_order 功能构建了 serde_json (每次都以相同顺序序列化对象)时,它才应该工作,但是我不是100%对此肯定.

I assume that it should only work if serde_json is built with preserve_order feature (to have objects serialized in the same order every time), but I am not 100% sure about it.

主要用法上下文:

JSON模式验证." uniqueItems "关键字实现.

JSON Schema validation. "uniqueItems" keyword implementation.

相关使用案例

重复数据删除JSON数组以优化JSON Schema推断.

Deduplication of JSON arrays to optimize JSON Schema inference on them.

例如,输入数据为 [1、2,{"foo":"bar"}] .直接推断可能会输出以下内容:

For example, the input data is [1, 2, {"foo": "bar"}]. A straightforward inference might output this:

{
    "type": "array", 
    "items": {
        "anyOf": [
            {"type": "integer"}, 
            {"type": "integer"},
            {"type": "object", "required": ["foo"]}
        ]
    }
}

items/anyOf 中的

值只能减少为两个值.

values in items/anyOf can be reduced to only two values.

问题:检查任意JSON数组中是否没有重复项的最省时,最正确的方法是什么?

Question: What would be the most time-efficient and correct way to check that there are no duplicates in an arbitrary JSON array?

我使用了 serde_json ="1.0.48"

铁锈:1.42.0

游乐场

推荐答案

将每个数组项转换为字符串是相当昂贵的–它要求每个项至少分配一个字符串,而且可能还要更多.确保映射(或JSON语言中的对象")以规范形式表示也很困难.

Converting each array item to a string is rather expensive – it requires at least one string allocation per item, and quite likely more than that. It's also difficult to make sure mappings (or "objects" in JSON language) are represented in a canonical form.

一种更快,更强大的替代方法是自己为 Value 实施 Hash .您需要定义一个新类型包装器,因为您不能在外来类型上实现外来特征.这是一个简单的示例实现:

A faster and more robust alternative is to implement Hash for Value yourself. You need to define a newtype wrapper, since you can't implement a foreign trait on a foreign type. Here's a simple example implementation:

use serde_json::Value;
use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;

#[derive(PartialEq)]
struct HashValue<'a>(pub &'a Value);

impl Eq for HashValue<'_> {}

impl Hash for HashValue<'_> {
    fn hash<H: Hasher>(&self, state: &mut H) {
        use Value::*;
        match self.0 {
            Null => state.write_u32(3_221_225_473), // chosen randomly
            Bool(ref b) => b.hash(state),
            Number(ref n) => {
                if let Some(x) = n.as_u64() {
                    x.hash(state);
                } else if let Some(x) = n.as_i64() {
                    x.hash(state);
                } else if let Some(x) = n.as_f64() {
                    // `f64` does not implement `Hash`. However, floats in JSON are guaranteed to be
                    // finite, so we can use the `Hash` implementation in the `ordered-float` crate.
                    ordered_float::NotNan::new(x).unwrap().hash(state);
                }
            }
            String(ref s) => s.hash(state),
            Array(ref v) => {
                for x in v {
                    HashValue(x).hash(state);
                }
            }
            Object(ref map) => {
                let mut hash = 0;
                for (k, v) in map {
                    // We have no way of building a new hasher of type `H`, so we
                    // hardcode using the default hasher of a hash map.
                    let mut item_hasher = DefaultHasher::new();
                    k.hash(&mut item_hasher);
                    HashValue(v).hash(&mut item_hasher);
                    hash ^= item_hasher.finish();
                }
                state.write_u64(hash);
            }
        }
    }
}

None 的值是随机选择的,以使其不太可能与其他条目发生冲突.为了计算浮点数的哈希值,我使用了 ordered-float 条板箱.对于映射,代码为每个键/值对计算一个哈希,然后将这些哈希简单地进行XOR运算,这与顺序无关.不幸的是,我们需要对用于哈希映射条目的哈希器进行硬编码.我们可以通过定义自己的 Hash 特性版本来抽象出来,然后从我们的自定义 Hash 特性,但这会使代码复杂得多,因此除非您需要,否则我不会这样做.

The value for None is chosen randomly to make it unlikely to collide with other entries. To calculate hashes for floating point numbers, I used the ordered-float crate. For mappings, the code calculates a hash for each key/value pair and simply XORs these hashes together, which is order-independent. It's a bit unfortunate that we need to hardcode the hasher used for hashing the map entries. We could abstract that out by defining our own version of the Hash trait, and then derive concrete implementations of std::hash::Hash from our custom Hash trait, but this complicates the code quite a bit, so I wouldn't do that unless you need to.

我们无法导出 Eq ,因为 Value 不实现 Eq .但是,我认为这只是一个疏忽,所以我提出了添加 Eq 实施(PR已被接受,因此它将在将来的版本中发布).

We can't derive Eq, since Value does not implement Eq. However, I believe this is just an oversight, so I filed an issue to add an Eq implementation (which the PR has been accepted for, so it will land in some future release).

这篇关于检测JSON数组中没有重复项的最快正确方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆