是否有带有 UTF-16 字符串类型的 Rust 库?(用于编写 Javascript 解释器) [英] Is there a Rust library with an UTF-16 string type? (intended for writing a Javascript interpreter)

查看:56
本文介绍了是否有带有 UTF-16 字符串类型的 Rust 库?(用于编写 Javascript 解释器)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于大多数程序,最好在内部使用 UTF-8,并且在必要时, 转换为其他编码.但就我而言,我想编写一个 Javascript 解释器,并且只存储 UTF-16 字符串(或 u16 的数组)要简单得多,因为

For most programs, it's better to use UTF-8 internally and, when necessary, convert to other encodings. But in my case, I want to write a Javascript interpreter, and it's much simpler to store only UTF-16 strings (or arrays of u16), because

  1. 我需要单独处理 16 位代码单元(这通常是一个坏主意,但 Javascript 需要这样做).这意味着我需要它来实现 Index.

我需要存储不成对的代理,即格式错误的 UTF-16 字符串(因此,ECMAScript 字符串在技术上被定义为 u16 的数组,即 通常代表 UTF-16 字符串).有一种编码恰当地命名为 WTF-8 以在 UTF-8 中存储未配对的代理,但我不想使用这样的东西.

I need to store unpaired surrogates, that is, malformed UTF-16 strings (because of this, ECMAScript strings are technically defined as arrays of u16, that usually represent UTF-16 strings). There is an encoding aptly named WTF-8 to store unpaired surrogates in UTF-8, but I don't want to use something like this.

我想要通常拥有/借用的类型(如 String/strCString/CStr) 使用所有或最常用的方法.我不想滚动我自己的字符串类型(如果可以避免的话).

I want to have the usual owned / borrowed types (like String / str and CString / CStr) with all or most usual methods. I don't want to roll my own string type (if I can avoid).

此外,我的字符串始终是不可变的,位于 Rc 后面,并从包含指向所有字符串的弱指针的数据结构引用(实现 字符串实习).这可能是相关的:也许将 Rc 作为字符串类型会更好,其中 Utf16Str 是未定义大小的字符串类型(可以定义为 Utf16Str代码>struct Utf16Str([u16])).这将避免在访问字符串时遵循两个指针,但我不知道如何使用未确定大小的类型实例化 Rc.

Also, my strings will always be immutable, behind an Rc and referred from a data structure containing weak pointers to all strings (implementing string interning). This might be relevant: perhaps it would be better to have Rc<Utf16Str> as the string type, where Utf16Str is the unsized string type (which can be defined as just struct Utf16Str([u16])). That would avoid following two pointers when accessing the string, but I don't know how to instantiate an Rc with an unsized type.

鉴于上述要求,仅仅使用 rust-encoding 非常不方便,因为它处理所有非 UTF-8 编码 作为 u8 的向量.

Given the above requirements, merely using rust-encoding is very inconvenient, because it treats all non-UTF-8 encodings as vectors of u8.

另外,我不确定 使用标准库 可能对我有帮助.我查看了 Utf16Units ,它只是一个迭代器,而不是正确的字符串类型.(另外,我知道 OsString 没有帮助 - 我不在 Windows 上,它甚至没有实现 Index)

Also, I'm not sure if using the std library at all might help me here. I looked into Utf16Units and it's just an iterator, not a proper string type. (also, I know OsString doesn't help - I'm not on Windows, and it doesn't even implement Index<usize>)

推荐答案

由于这里有多个问题,我将尝试分别回复:

Since there are multiple questions here I’ll try to respond separately:

我认为你想要的类型是 [u16]Vec.

I think the types you want are [u16] and Vec<u16>.

默认的字符串类型 strString[u8]Vec 的包装器(str 在技术上不正确,它是原始的,但足够接近).拥有单独类型的要点是保持底层字节在 UTF-8 中格式良好的不变性.

The default string types str and String are wrappers around [u8] and Vec<u8> (not technically true of str which is primitive, but close enough). The point of having separate types is to maintain the invariant that the underlying bytes are well-formed in UTF-8.

同样,你可以让 Utf16StrUtf16String 类型包裹在 [u16]Vec保留格式良好的 UTF-16 不变量,即没有未配对的代理.

Similarly, you could have Utf16Str and Utf16String types wrapping around [u16] and Vec<u16> that preserve a well-formed in UTF-16 invariant, namely that there is no unpaired surrogate.

但正如您在问题中所指出的,JavaScript 字符串可以包含未配对的代理项.那是因为 JavaScript 字符串不是严格的 UTF-16,它们实际上是 u16 的任意序列,没有额外的不变性.

But as you note in your question, JavaScript strings can contain unpaired surrogates. That’s because JavaScript strings are not strictly UTF-16, they really are arbitrary sequences of u16 with no additional invariant.

由于没有要维护的不变性,我认为包装器类型并没有那么有用.

With no invariant to maintain, I don’t think wrapper types are all that useful.

rust-encoding 基于字节支持 UTF-16-LE 和 UTF-16-BE.您可能需要基于 u16 的 UTF-16.

rust-encoding supports UTF-16-LE and UTF-16-BE based on bytes. You probably want UTF-16 based on u16’s instead.

std::str::Utf16Units 确实不是字符串类型.它是由 str::utf16_units() 方法返回的迭代器,它将 Rust 字符串转换为 UTF-16(不是 LE 或 BE).例如,您可以在该迭代器上使用 .collect() 来获取 Vec.

std::str::Utf16Units is indeed not a string type. It is an iterator returned by the str::utf16_units() method that converts a Rust string to UTF-16 (not LE or BE). You can use .collect() on that iterator to get a Vec<u16> for example.

获得Rc<[u16]>的唯一安全方法是从Rc<[u16;N]> 其大小在编译时已知,这显然是不切实际的.我不推荐不安全的方式:分配内存,向其写入一个希望与 RcBox 的内存表示相匹配的标头,然后进行转换.

The only safe way to obtain Rc<[u16]> is to coerce from Rc<[u16; N]> whose size is known at compile-time, which is obviously impractical. I wouldn’t recommend the unsafe way: allocating memory, writing a header to it that hopefully matches the memory representation of RcBox, and transmuting.

如果你打算用原始内存分配来做,最好使用你自己的类型,这样你就可以使用它的私有字段.Tendril 这样做:https://github.com/servo/tendril/blob/master/src/buf32.rs

If you’re gonna do it with raw memory allocation, better use your own type so that you can use its private fields. Tendril does this: https://github.com/servo/tendril/blob/master/src/buf32.rs

或者,如果您愿意承担额外的间接费用,Rc> 更安全且更容易.

Or, if you’re willing to take the cost of the extra indirection, Rc<Vec<u16>> is safe and much easier.

这篇关于是否有带有 UTF-16 字符串类型的 Rust 库?(用于编写 Javascript 解释器)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆