使用HashSet规范Rust中的对象 [英] Using a HashSet to canonicalize objects in Rust

查看:410
本文介绍了使用HashSet规范Rust中的对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为一项教育活动,我正在考虑移植 cvs-fast-export 到Rust。

As an educational exercise, I'm looking at porting cvs-fast-export to Rust.

其基本操作方式是将多个CVS主文件解析为中间形式,然后进行分析旨在将其转换为git fast-export流的中间形式。

Its basic mode of operation is to parse a number of CVS master files into a intermediate form, and then to analyse the intermediate form with the goal of transforming it into a git fast-export stream.

解析时要做的一件事是将中间形式的公共部分转换为规范表示。一个令人鼓舞的例子就是提交作者。 CVS存储库可能有成千上万的单个文件提交,但可能少于一千个作者。因此,在从文件中解析作者时,将使用一个interning表进行解析,它将为您提供一个指向规范版本的指针,如果以前没有看到过,则会创建一个新版本。 (我也听说这称为雾化或实习)。然后,该指针将存储在中间对象上。

One of the things that is done when parsing is to convert common parts of the intermediate form into a canonical representation. A motivating example is commit authors. A CVS repository may have hundreds of thousands of individual file commits, but probably less than a thousand authors. So an interning table is used when parsing where you input the author as you parse it from the file and it will give you a pointer to a canonical version, creating a new one if it hasn't seen it before. (I've heard this called atomizing or interning too). This pointer then gets stored on the intermediate object.

我在Rust中做类似尝试的第一次尝试是使用 HashSet 作为实习表。请注意,这是使用CVS版本号而不是作者,这只是一个数字序列,例如1.2.3.4,表示为 Vec

My first attempt to do something similar in Rust attempted to use a HashSet as the interning table. Note this uses CVS version numbers rather than authors, this is just a sequence of digits such as 1.2.3.4, represented as a Vec.

use std::collections::HashSet;
use std::hash::Hash;

#[derive(PartialEq, Eq, Debug, Hash, Clone)]
struct CvsNumber(Vec<u16>);

fn intern<T:Eq + Hash + Clone>(set: &mut HashSet<T>, item: T) -> &T {
    let dupe = item.clone();
    if !set.contains(&item) {
        set.insert(item);
    }
    set.get(&dupe).unwrap()
}

fn main() {
    let mut set: HashSet<CvsNumber> = HashSet::new();
    let c1 = CvsNumber(vec![1, 2]);
    let c2 = intern(&mut set, c1);
    let c3 = CvsNumber(vec![1, 2]);
    let c4 = intern(&mut set, c3);
}

此操作失败,并出现错误[E0499]:无法借用一次将设置为可变变量。这很公平, HashSet 不能保证如果在获得引用后添加更多项,则对其键的引用将是有效的。 C版本会谨慎地保证这一点。为了获得此保证,我认为 HashSet 应该超过 Box< T> 。但是我无法向借阅检查器解释此操作的生命周期。

This fails with error[E0499]: cannot borrow 'set' as mutable more than once at a time. This is fair enough, HashSet doesn't guarantee references to its keys will be valid if you add more items after you have obtained a reference. The C version is careful to guarantee this. To get this guarantee, I think the HashSet should be over Box<T>. However I can't explain the lifetimes for this to the borrow checker.

我要在此处使用的所有权模型是Interning表拥有数据的规范版本,并分发参考。只要实习表存在,这些引用就应该有效。我们应该能够在不使旧引用无效的情况下向内部表添加新内容。我认为问题的根源在于我很困惑如何以与Rust所有权模型一致的方式为该合同编写接口。

The ownership model I am going for here is that the interning table owns the canonical versions of the data, and hands out references. The references should be valid as long the interning table exists. We should be able to add new things to the interning table without invalidating the old references. I think the root of my problem is that I'm confused how to write the interface for this contract in a way consistent with the Rust ownership model.

我看到的解决方案我有限的Rust知识是:

Solutions I see with my limited Rust knowledge are:


  1. 做两遍,在生成的 HashSet 第一遍,然后冻结它,并在第二遍使用引用。这意味着需要额外的临时存储(有时是大量的)。

  2. 不安全

  1. Do two passes, build a HashSet on the first pass, then freeze it and use references on the second pass. This means additional temporary storage (sometimes substantial).
  2. Unsafe

有人能做到更好吗?

推荐答案

对于使用不安全此处。

现在 不会引起问题,如果将来有人决定更改 HashSet 进行一些修剪(例如,只保留一百位作者),然后不安全会严厉地咬你。

While right now it does not cause issue, should someone decide in the future to change the use of HashSet to include some pruning (for example, to only ever keep a hundred authors in there), then unsafe will bite you sternly.

在缺乏强大性能原因的情况下,我只会使用 Rc< XXX> 。您可以很容易地为其加上别名:类型InternedXXX = Rc< XXX> ;;

In the absence of a strong performance reason, I would simply use a Rc<XXX>. You can alias it easily enough: type InternedXXX = Rc<XXX>;.

use std::collections::HashSet;
use std::hash::Hash;
use std::rc::Rc;

#[derive(PartialEq, Eq, Debug, Hash, Clone)]
struct CvsNumber(Rc<Vec<u16>>);

fn intern<T:Eq + Hash + Clone>(set: &mut HashSet<T>, item: T) -> T {
    if !set.contains(&item) {
        let dupe = item.clone();
        set.insert(dupe);
        item
    } else {
        set.get(&item).unwrap().clone()
    }
}

fn main() {
    let mut set: HashSet<CvsNumber> = HashSet::new();
    let c1 = CvsNumber(Rc::new(vec![1, 2]));
    let c2 = intern(&mut set, c1);
    let c3 = CvsNumber(Rc::new(vec![1, 2]));
    let c4 = intern(&mut set, c3);
}

这篇关于使用HashSet规范Rust中的对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆