MongoDB 可以存储和操作带有基本多语言平面之外的代码点的 UTF-8 字符串吗? [英] Can MongoDB store and manipulate strings of UTF-8 with code points outside the basic multilingual plane?

查看:72
本文介绍了MongoDB 可以存储和操作带有基本多语言平面之外的代码点的 UTF-8 字符串吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 MongoDB 2.0.6 中,当尝试存储包含字符串字段的文档或查询文档时,其中字符串的值包含 BMP 之外的字符,我收到大量错误,例如:不正确的 UTF-16:55357",或缓冲区太小"

In MongoDB 2.0.6, when attempting to store documents or query documents that contain string fields, where the value of a string include characters outside the BMP, I get a raft of errors like: "Not proper UTF-16: 55357", or "buffer too small"

有哪些设置、更改或建议允许在 Mongo 中存储和查询多语言字符串,尤其是那些包含 0xFFFF 以上字符的字符串?

What settings, changes, or recommendations are there to permit storage and query of multi-lingual strings in Mongo, particularly ones that include these characters above 0xFFFF?

谢谢.

推荐答案

这里有几个问题:

1) 请注意,MongoDB 使用 BSON 格式存储所有文档.另请注意,BSON 规范指的是 UTF-8 字符串编码,而不是 UTF-16 编码.

1) Please be aware that MongoDB stores all documents using the BSON format. Also note that the BSON spec referes to a UTF-8 string encoding, not a UTF-16 encoding.

参考:http://bsonspec.org/#/specification

2) 所有驱动程序,包括 mongo shell 中的 JavaScript 驱动程序,都应该正确处理编码为 UTF-8 的字符串.(如果他们不这样做,那就是一个错误!)许多驱动程序碰巧也能正确处理 UTF-16,尽管据我所知,UTF-16 并未得到官方支持.

2) All of the drivers, including the JavaScript driver in the mongo shell, should properly handle strings that are encoded as UTF-8. (If they don't then it's a bug!) Many of the drivers happen to handle UTF-16 properly, as well, although as far as I know, UTF-16 isn't officially supported.

3) 当我使用 Python 驱动程序对此进行测试时,MongoDB 可以成功加载并返回包含损坏的 UTF-16 代码对的字符串值.但是,我无法使用 mongo shell 加载损坏的代码对,也无法将包含损坏的代码对的字符串存储到 shell 中的 JavaScript 变量中.

3) When I tested this with the Python driver, MongoDB could successfully load and return a string value that contained a broken UTF-16 code pair. However, I couldn't load a broken code pair using the mongo shell, nor could I store a string containing a broken code pair into a JavaScript variable in the shell.

4) mapReduce() 使用正确的 UTF-16 代码对在字符串数据上正确运行,但尝试在包含损坏代码对的字符串数据上运行 mapReduce() 时会产生错误.

4) mapReduce() runs correctly on string data using a correct UTF-16 code pair, but it will generate an error when trying to run mapReduce() on string data containing a broken code pair.

当 MongoDB 尝试将 BSON 转换为 JavaScript 引擎使用的 JavaScript 变量时,mapReduce() 似乎失败了.

It appears that the mapReduce() is failing when MongoDB is trying to convert the BSON to a JavaScript variable for use by the JavaScript engine.

5) 我已经为此问题提交了 Jira 问题 SERVER-6747.随时关注它并投票.

5) I've filed Jira issue SERVER-6747 for this issue. Feel free to follow it and vote it up.

这篇关于MongoDB 可以存储和操作带有基本多语言平面之外的代码点的 UTF-8 字符串吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆