C ++ UTF-8轻量级允许代码? [英] C++ UTF-8 lightweight & permissive code?

查看:208
本文介绍了C ++ UTF-8轻量级允许代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都知道这是一个更宽松的许可证(MIT /公共领域)版本:



http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html



(UTF-8知道的std :: string的'drop-in'替换)



轻量级,做我需要的一切,我会使用UTF-XX转换甚至)



我真的不想与我携带ICU。


  • 如果需要分析文本,则可以使用std :: string来处理UTF-8存储。本身,UTF-8意识不会帮助你,因为有太多的东西在Unicode中不工作的代码基地。

  • 看看Boost.Locale库(它使用ICU下面):





    它不是轻量级的,但它允许你正确处理Unicode,并使用 std :: string 作为存储。



    如果你期望找到Unicode感知的轻量级库来处理字符串,不轻量级。甚至相对简单的东西,如大写,小写转换或Unicode规范化需要复杂的算法和Unicode数据库访问。



    如果你需要一个能力迭代代码点(BTW是字符)
    查看 http: //utfcpp.sourceforge.net/



    回应评论:


    1)查找我包含的文件的文件格式


    std :: string :: find


    2)换行符检测


    这不是一个简单的问题。你曾经试图在中文/日文中找到换行符吗?可能不是因为空格不分隔词。因此,线路中断检测是很困难的。 (我不认为glib这样做正确,我认为只有pango有类似的东西)



    当然Boost.Locale做到这一点和正确。



    如果你只需要搜索欧洲语言,只需搜索空格或标点符号, std :: string :: find 更好。


    3)字符(或现在,代码点)计数看utfcpp thx


    字符不是代码点,例如希伯来词Shalom - שָלוֹם由4个字符和6个Unicode点组成,其中使用两个代码点元音。与欧洲语言相同,其中单字符和用两个代码点表示,例如:ü可以表示为u和¨ - 两个代码点。



    所以,如果你知道这些问题,那么utfcpp会很好,否则你不会
    找到更简单的东西。


    Anyone know of a more permissive license (MIT / public domain) version of this:

    http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html

    ('drop-in' replacement for std::string thats UTF-8 aware)

    Lightweight, does everything I need and even more (doubt I'll use the UTF-XX conversions even)

    I really don't want to be carrying ICU around with me.

    解决方案

    1. std::string is fine for UTF-8 storage.
    2. If you need to analyze the text itself, the UTF-8 awareness will not help you much as there are too many things in Unicode that do not work on codepoint base.

    Take a look on Boost.Locale library (it uses ICU under the hood):

    It is not lightweight but it allows you handle Unicode correctly and it uses std::string as storage.

    If you expect to find Unicode-aware lightweight library to deal with strings, you'll not find such things, because Unicode is not lightweight. And even relatively "simple" stuff like upper-case, lower-case conversion or Unicode normalization require complex algorithms and Unicode data-base access.

    If you need an ability to iterate over Code points (that BTW are not characters) take a look on http://utfcpp.sourceforge.net/

    Answer to comment:

    1) Find file formats for files included by me

    std::string::find is perfectly fine for this.

    2) Line break detection

    This is not a simple issue. Have you ever tried to find a line-break in Chinese/Japanese text? Probably not as space does not separate words. So line-break detection is hard job. (I don't think even glib does this correctly, I think only pango has something like that)

    And of course Boost.Locale does this and correctly.

    And if you need to do this for European languages only, just search for space or punctuation marks, so std::string::find is more then fine.

    3) Character (or now, code point) counting Looking at utfcpp thx

    Characters are not code points, for example a Hebrew word Shalom -- "שָלוֹם" consists of 4 characters and 6 Unicode points, where two code points are used for vowels. Same for European languages where singe character and be represented with two code points, for example: "ü" can be represented as "u" and "¨" -- two code points.

    So if you are aware of these issues then utfcpp will be fine, otherwise you will not find anything simpler.

    这篇关于C ++ UTF-8轻量级允许代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆