是否有带有 UTF-16 字符串类型的 Rust 库?(用于编写 Javascript 解释器) [英] Is there a Rust library with an UTF-16 string type? (intended for writing a Javascript interpreter)
问题描述
对于大多数程序,最好在内部使用 UTF-8,并且在必要时, 转换为其他编码.但就我而言,我想编写一个 Javascript 解释器,并且只存储 UTF-16 字符串(或 u16
的数组)要简单得多,因为
For most programs, it's better to use UTF-8 internally and, when necessary, convert to other encodings. But in my case, I want to write a Javascript interpreter, and it's much simpler to store only UTF-16 strings (or arrays of u16
), because
我需要单独处理 16 位代码单元(这通常是一个坏主意,但 Javascript 需要这样做).这意味着我需要它来实现
Index
.
我需要存储不成对的代理,即格式错误的 UTF-16 字符串(因此,ECMAScript 字符串在技术上被定义为 u16
的数组,即 通常代表 UTF-16 字符串).有一种编码恰当地命名为 WTF-8 以在 UTF-8 中存储未配对的代理,但我不想使用这样的东西.
I need to store unpaired surrogates, that is, malformed UTF-16 strings (because of this, ECMAScript strings are technically defined as arrays of u16
, that usually represent UTF-16 strings). There is an encoding aptly named WTF-8 to store unpaired surrogates in UTF-8, but I don't want to use something like this.
我想要通常拥有/借用的类型(如 String
/str
和 CString
/CStr
) 使用所有或最常用的方法.我不想滚动我自己的字符串类型(如果可以避免的话).
I want to have the usual owned / borrowed types (like String
/ str
and CString
/ CStr
) with all or most usual methods. I don't want to roll my own string type (if I can avoid).
此外,我的字符串始终是不可变的,位于 Rc
后面,并从包含指向所有字符串的弱指针的数据结构引用(实现 字符串实习).这可能是相关的:也许将 Rc
作为字符串类型会更好,其中 Utf16Str
是未定义大小的字符串类型(可以定义为 Utf16Str
代码>struct Utf16Str([u16])).这将避免在访问字符串时遵循两个指针,但我不知道如何使用未确定大小的类型实例化 Rc
.
Also, my strings will always be immutable, behind an Rc
and referred from a data structure containing weak pointers to all strings (implementing string interning). This might be relevant: perhaps it would be better to have Rc<Utf16Str>
as the string type, where Utf16Str
is the unsized string type (which can be defined as just struct Utf16Str([u16])
). That would avoid following two pointers when accessing the string, but I don't know how to instantiate an Rc
with an unsized type.
鉴于上述要求,仅仅使用 rust-encoding 非常不方便,因为它处理所有非 UTF-8 编码 作为 u8
的向量.
Given the above requirements, merely using rust-encoding is very inconvenient, because it treats all non-UTF-8 encodings as vectors of u8
.
另外,我不确定 使用标准库 可能对我有帮助.我查看了 Utf16Units
,它只是一个迭代器,而不是正确的字符串类型.(另外,我知道 OsString
没有帮助 - 我不在 Windows 上,它甚至没有实现 Index
)
Also, I'm not sure if using the std library at all might help me here. I looked into Utf16Units
and it's just an iterator, not a proper string type. (also, I know OsString
doesn't help - I'm not on Windows, and it doesn't even implement Index<usize>
)
推荐答案
由于这里有多个问题,我将尝试分别回复:
Since there are multiple questions here I’ll try to respond separately:
我认为你想要的类型是 [u16]
和 Vec
.
I think the types you want are [u16]
and Vec<u16>
.
默认的字符串类型 str
和 String
是 [u8]
和 Vec
的包装器(str
在技术上不正确,它是原始的,但足够接近).拥有单独类型的要点是保持底层字节在 UTF-8 中格式良好的不变性.
The default string types str
and String
are wrappers around [u8]
and Vec<u8>
(not technically true of str
which is primitive, but close enough). The point of having separate types is to maintain the invariant that the underlying bytes are well-formed in UTF-8.
同样,你可以让 Utf16Str
和 Utf16String
类型包裹在 [u16]
和 Vec
保留格式良好的 UTF-16 不变量,即没有未配对的代理.
Similarly, you could have Utf16Str
and Utf16String
types wrapping around [u16]
and Vec<u16>
that preserve a well-formed in UTF-16 invariant, namely that there is no unpaired surrogate.
但正如您在问题中所指出的,JavaScript 字符串可以包含未配对的代理项.那是因为 JavaScript 字符串不是严格的 UTF-16,它们实际上是 u16
的任意序列,没有额外的不变性.
But as you note in your question, JavaScript strings can contain unpaired surrogates. That’s because JavaScript strings are not strictly UTF-16, they really are arbitrary sequences of u16
with no additional invariant.
由于没有要维护的不变性,我认为包装器类型并没有那么有用.
With no invariant to maintain, I don’t think wrapper types are all that useful.
rust-encoding 基于字节支持 UTF-16-LE 和 UTF-16-BE.您可能需要基于 u16
的 UTF-16.
rust-encoding supports UTF-16-LE and UTF-16-BE based on bytes. You probably want UTF-16 based on u16
’s instead.
std::str::Utf16Units
确实不是字符串类型.它是由 str::utf16_units()
方法返回的迭代器,它将 Rust 字符串转换为 UTF-16(不是 LE 或 BE).例如,您可以在该迭代器上使用 .collect()
来获取 Vec
.
std::str::Utf16Units
is indeed not a string type. It is an iterator returned by the str::utf16_units()
method that converts a Rust string to UTF-16 (not LE or BE). You can use .collect()
on that iterator to get a Vec<u16>
for example.
获得Rc<[u16]>
的唯一安全方法是从Rc<[u16;N]>
其大小在编译时已知,这显然是不切实际的.我不推荐不安全的方式:分配内存,向其写入一个希望与 RcBox
的内存表示相匹配的标头,然后进行转换.
The only safe way to obtain Rc<[u16]>
is to coerce from Rc<[u16; N]>
whose size is known at compile-time, which is obviously impractical. I wouldn’t recommend the unsafe way: allocating memory, writing a header to it that hopefully matches the memory representation of RcBox
, and transmuting.
如果你打算用原始内存分配来做,最好使用你自己的类型,这样你就可以使用它的私有字段.Tendril 这样做:https://github.com/servo/tendril/blob/master/src/buf32.rs
If you’re gonna do it with raw memory allocation, better use your own type so that you can use its private fields. Tendril does this: https://github.com/servo/tendril/blob/master/src/buf32.rs
或者,如果您愿意承担额外的间接费用,Rc
更安全且更容易.
Or, if you’re willing to take the cost of the extra indirection, Rc<Vec<u16>>
is safe and much easier.
这篇关于是否有带有 UTF-16 字符串类型的 Rust 库?(用于编写 Javascript 解释器)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!