在c ++中的Utf-8:quick&肮脏的把戏 [英] Utf-8 in c++: quick & dirty tricks

查看:140
本文介绍了在c ++中的Utf-8:quick&肮脏的把戏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有关于utf-8的各种问题,主要是关于库来操作utf-8'string'像对象。



工作在一个'国际化'项目(一个网站,我编码的一个c + +后端...不要问)在哪里即使我们处理utf-8我们不适合需要这样的库。大多数时候,普通的std :: string方法或STL算法都足以满足我们的需要,事实上这是使用utf-8的首要目标。



所以,我在这里寻找的是一个大写的快速&脏的技巧,你知道相关的utf-8存储为std :: string(没有const char *我不关心c风格的代码真的,我有更好的事情,而不是一直担心我的缓冲区大小)。



例如,这里是一个



<$ em

快速&脏技巧获取字符数(这对于了解是否适合您的显示框很有用) p $ p> #include< string>
#include< algorithm>

//让我们记住比utf-8编码,一个字符可能是
// 1个字节:'0 .......'
// 2个字节:'110 .....''10 ......'
// 3个字节:'1110 ....''10 ...... '10 。'
// 4个字节:'11110 ...''10 ......''10 ......''10 ......'
//因此'10 ......'不是字符的开头;)

const unsigned char mask = 0xC0;
const unsigned char notUtf8Begin = 0x80;

struct Utf8Begin
{
bool operator(char c)const {return(c& mask)!= notUtf8Begin; }
};

//让我们计数
size_t countUtf8Characters(const std :: string& s)
{
return std :: count_if(s.begin(),s。 end(),Utf8Begin());
}

事实上,我还没有遇到一个usecase, std :: string或STL算法不提供免费的字符数,因为:




  • 排序按预期工作

    li>
  • 字词的任何部分都不能混淆为某个字词或其他字词的一部分



喜欢知道你是否有其他类似的技巧,用于计数和其他简单的任务。

我重复一遍,我知道 ICU Utf8-CPP ,但我对他们不感兴趣,因为我不需要一个完整的治疗(事实上我从来不需要超过字符数)。

我还重复,我不喜欢治疗char *,他们是老式的。

解决方案

这个肮脏的技巧不会工作。
首先,这之后是什么值:

  const unsigned char mask = 0x11000000; 
const unsigned char notUtf8Begin = 0x10000000;

也许你正在用二进制混合十六进制表示。



其次,正如你在utf-8编码中所说的,一个字符可能有几个字节长。
std :: count_if将遍历UTF8序列中的所有字节。
但是你实际需要的是查看每个字符的前导字节,并跳过其余的字符到下一个字符。



这将不难实现使用简单的掩码表对前导字节进行计算和跳转
的单个周期。



最后,你得到相同的O检查字符,它将使用每个UTF8字符串。


I am aware that there are been various questions about utf-8, mainly about libraries to manipulate utf-8 'string' like objects.

However, I am working on an 'internationalized' project (a website, of which I code a c++ backend... don't ask) where even if we deal with utf-8 we don't acutally need such libraries. Most of the times the plain std::string methods or STL algorithms are very sufficient to our needs, and indeed this is the goal of using utf-8 in the first place.

So, what I am looking for here is a capitalization of the "Quick & Dirty" tricks that you know of related to utf-8 stored as std::string (no const char*, I don't care about c-style code really, I've got better things to do than constantly worrying about my buffer size).

For example, here is a "Quick & Dirty" trick to obtain the number of characters (which is useful to know if it will fit in your display box):

#include <string>
#include <algorithm>

// Let's remember than in utf-8 encoding, a character may be
// 1 byte: '0.......'
// 2 bytes: '110.....' '10......'
// 3 bytes: '1110....' '10......' '10......'
// 4 bytes: '11110...' '10......' '10......' '10......'
// Therefore '10......' is not the beginning of a character ;)

const unsigned char mask = 0xC0;
const unsigned char notUtf8Begin = 0x80;

struct Utf8Begin
{
  bool operator(char c) const { return (c & mask) != notUtf8Begin; }
};

// Let's count
size_t countUtf8Characters(const std::string& s)
{
  return std::count_if(s.begin(), s.end(), Utf8Begin());
}

In fact I have yet to encounter a usecase when I would need anything else than the number of characters and that std::string or the STL algorithms don't offer for free since:

  • sorting works as expected
  • no part of a word can be confused as a word or part of another word

I would like to know if you have other comparable tricks, both for counting and for other simple tasks.
I repeat, I know about ICU and Utf8-CPP, but I am not interested in them since I don't need a full-fledged treatment (and in fact I have never needed more than the count of characters).
I also repeat that I am not interested in treating char*'s, they are old-fashioned.

解决方案

Well this dirty trick will not work. First, what is the value of mask after this:

   const unsigned char mask = 0x11000000;
   const unsigned char notUtf8Begin = 0x10000000;

Perhaps you are mixing hex representation with binary.

Second, as you correctly say in utf-8 encoding, a character may be several bytes long. std::count_if will iterate through all bytes in a UTF8 sequence. But what you actually need is to look at leading byte for every character and skip the rest until the next character comes.

It will not be hard to implement a single cycle which does the calculation and jumping forward using the simple mask table for leading bytes.

At the end you get the same O(n) for checking the characters and it will work with every UTF8 string.

这篇关于在c ++中的Utf-8:quick&amp;肮脏的把戏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆