将JavaScript字符串拆分为代码点数组? (考虑“代理对”而不考虑“字形簇”) [英] Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters")

查看:187
本文介绍了将JavaScript字符串拆分为代码点数组? (考虑“代理对”而不考虑“字形簇”)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将JavaScript字符串拆分为字符可以简单地完成,但如果您关心Unicode(并且您应该关心Unicode)则会出现问题。

Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).

本地JavaScript将字符视为16位实体( UCS-2或UTF-16 )但这样做不允许 BMP(Basic Multilingual Plane)之外的Unicode字符。

JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane).

要处理BMP之外的Unicode字符,JavaScript必须考虑代理对,它本身不做。

To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively.

我正在寻找如何通过代码点拆分js字符串,无论是代码点需要一个或两个JavaScript字符(代码单元)。

根据您的需要,按 codepoint 可能还不够,你可能想要拆分字形集群,其中集群是基本代码点,后跟所有非间距修饰符代码点,例如结合重音符号和变音符号

Depending on your needs, splitting by codepoint might not be enough, and you might want to split by "grapheme cluster", where a cluster is a base codepoint followed by all its non-spacing modifier codepoints, such as combining accents and diacritics.

出于本问题的目的,我不需要通过字形簇进行拆分。

推荐答案

@ bobince的答案(幸运的是)变得有点过时了;你现在可以简单地使用

@bobince's answer has (luckily) become a bit dated; you can now simply use

var chars = Array.from( text )

获取单一代码点字符串列表,该字符串确实遵循星体/ 32位/代理Unicode字符。

to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.

这篇关于将JavaScript字符串拆分为代码点数组? (考虑“代理对”而不考虑“字形簇”)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆