如何检测“— (unicode的组合)在C ++字符串中 [英] how to detect "​" (combination of unicode) in c++ string

查看:126
本文介绍了如何检测“— (unicode的组合)在C ++字符串中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试检测Unicode字符的某些组合(如​)以清理字符串。对于单个Unicode字符,它正在检测,但是Unicode组合未检测。

I am trying to detect some of the combination of Unicode character (like ​) to cleanup the string, For a single Unicode character it is detecting but combination of Unicode is not detecting.

我正在使用这些字符串从另一个需要清除的HTML页面制作HTML页面。我只想清理在浏览器的html页面中甚至看不见的具有此类unicode的字符串。

These string I am using to make HTML page from another HTML page which need to be cleanup. I want to clean only string which have these kind of unicode that not even visible in html page in browser.

下面是示例代码:

void detect_Unicode(string& str) { 

      if(!str.empty() && str.find_first_not_of(" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039")==string::npos)
                str.assign(" ");
      return;
 }

输入字符串:

1. " ​    ​ " ;
2. "are   there is something    ​ combination    ​"  
3. " Â Â "   
4. "​    ​" 
5 . "Â Â â â" 

预期输出:

1. " "  
2. "are   there is something    ​ combination    ​"   
3. " "  
4. " "  
5. " "

也请让我知道其他方式。

Please let me know other ways too.

推荐答案

好吧,接着上面的评论,我认为输入字符串很有可能是UTF-8(毕竟,在HTML上下文中,还会是什么? )。

OK, following on from the comments above, I think it's highly likely that the input string is in UTF-8 (after all, in an HTML context, what else would it be?).

在此基础上,我谦虚地提交以下内容:

On that basis, I humbly submit this:

#include <string>
#include <codecvt>
#include <locale>

std::string narrow (const std::wstring& ws)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.to_bytes (ws);
}

std::wstring widen (const std::string& s)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.from_bytes (s);
}

std::string detect_Unicode (const std::string& s)
{ 
    std::wstring ws = widen (s);
    if (ws.empty() || ws.find_first_not_of (L" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039") != std::wstring::npos)
        return " ";
    return s;
}

#include <iostream>

int main ()
{
    std::cout << narrow (L"\u00A0 \u00C2 \u00E2 \u20AC \u2039\n\n");
    std::cout << "0.\t\"" << detect_Unicode (u8"abcde") << "\"\n";
    std::cout << "1.\t\"" << detect_Unicode (u8" ​    ​ ") << "\"\n";
    std::cout << "2.\t\"" << detect_Unicode (u8"are   there is something    ​ combination    ​") << "\"\n";
    std::cout << "3.\t\"" << detect_Unicode (u8" Â Â ") << "\"\n";
    std::cout << "4.\t\"" << detect_Unicode (u8"​    ​") << "\"\n";
    std::cout << "5.\t\"" << detect_Unicode (u8"Â Â â â") << "\"\n";
}

输出:

  Â â € ‹

0.  " "
1.  " ​    ​ "
2.  " "
3.  " Â Â "
4.  "​    ​"
5.  "Â Â â â"

现在这不是OP期望的输出,但是我认为这仅仅是因为<$> logic (而不是实现) c $ c> detect_Unicode()似乎有缺陷。这里的要点是将输入字符串转换为宽字符串意味着您可以可靠地对其执行标准的 basic_string 操作,因为现在没有多字节问题。

Now this is not the output the OP expects, but I think that's simply because the logic (as opposed to the implementation) of detect_Unicode() looks flawed. The point here is that converting the input string to a wide string means that you can use standard basic_string operations on it reliably, because there are no multibyte issues now.

detect_Unicode()的另一种实现方式可能是:

An alternative, slightly radical, implementation of detect_Unicode() might be:

for (auto wide_char : ws)
{
    if (wide_char > 0xff)
        return " ";
}
return s;

但实际上,现在您可以使用宽字符串来提交 detect_Unicode ,一切皆有可能,因此请疯狂操作。

But really, now you have a wide string to hand in detect_Unicode, anything is possible, so go wild OP.

其他说明:


  • std :: codecvt 在C ++ 17中已被弃用,但是由于没有其他明显的选择,您最好使用它。您可以随时更改 narrow widen 的实现。

  • 取决于平台, std :: wstring 可能不是最佳选择,但可能还不错。您还可以查看 std :: u16string std :: u32string

  • std::codecvt is deprecated in C++17, but since there is no other obvious choice you might as well run with it. You can always change the implementations of narrow and widen if it comes to it.
  • Depending on platform, std::wstring might not be the best choice but it's probably fine. You could also look at std::u16string and std::u32string.

在线演示

灵感来自此处

这篇关于如何检测“— (unicode的组合)在C ++字符串中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆