忽略C ++中的字节顺序标记,从流中读取 [英] Ignore byte-order marks in C++, reading from a stream

查看:187
本文介绍了忽略C ++中的字节顺序标记,从流中读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个函数来读取 ifstream 中单行上一个变量(整数,双精度或布尔值)的值:

 模板< typename类型> 
void readFromFile(ifstream& in,Type& val)
{
string str;
getline(in,str);
stringstream ss(str);
ss>> val;
}

但是,对于使用编辑器插入BOM href =http://en.wikipedia.org/wiki/Byte_order_mark>字节顺序标记)在第一行的开头,不幸的是包括{Note,Word} pad。如何修改此函数以忽略字节顺序标记(如果存在于 str

解决方案

(我假设你在Windows上,因为在UTF-8文件中使用U + FEFF作为签名主要是Windows的东西,应该简单地避免在其他地方)



您可以将文件打开为UTF-8文件,然后检查第一个字符是否为U + FEFF。你可以通过打开一个基于正常字符的fstream来实现,然后使用wbuffer_convert将它作为另一个编码中的一系列代码单元。 VS2010对char32_t还没有很好的支持,所以下面的代码在wchar_t中使用UTF-16。

  std :: fstream fs文件名); 
std :: wbuffer_convert< std :: codecvt_utf8_utf16< wchar_t>,wchar_t> wb(fs.rdbuf());
std :: wistream is(& wb);
//如果你不在堆栈上这样做,记得以创建的相反顺序销毁对象。是,然后是wb,然后是fs。
std :: wistream :: int_type ch = is.get();
const std :: wistream :: int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF
if(ZERO_WIDTH_NO_BREAK_SPACE!= ch)
is.putback(ch);

//现在流可以传递和使用,而不必担心流中的额外字符。

int i;
readFromStream< int>(is,i);请记住,这应该在文件流作为一个整体,而不是在您的stringstream的readFromFile内部完成,因为忽略U + FEFF只应该做到,如果它是整个文件中的第一个字符,如果有的话。



另一方面,如果你喜欢使用基于字符的流,只是想跳过U + FEFF(如果存在) James Kanze的建议似乎很好,所以这里有一个实现:

  std :: fstream fs(filename); 
char a,b,c;
a = fs.get();
b = fs.get();
c = fs.get();
if(a!=(char)0xEF || b!=(char)0xBB || c!=(char)0xBF){
fs.seekg(0)
} else {
std :: cerr<< 警告:文件包含所谓的'UTF-8签名'\\\

}






此外,如果您要在内部使用 wchar_t codecvt_utf8_utf16 codecvt_utf8 构面具有可以为您消耗BOM的模式。唯一的问题是 wchar_t 被广泛认为是无价值的,这些天*,所以你可能不应该这样做。

  std :: wifstream fin(filename); 
fin.imbue(std :: locale(fin.getloc(),new std :: codecvt_utf8_utf16< wchar_t,0x10FFFF,std :: consume_header));
是无价值的它指定只做一件事情;提供可以表示语言环境的字符库中的任何代码点的固定大小的数据类型。它不会在区域设置之间提供常见的表示形式(即,相同的 wchar_t 值可以是不同的语言环境中的不同字符,因此您不必将其转换为 wchar_t ,切换到另一个语言环境,然后转换回 char 以执行 iconv 类编码转换。)



固定大小的表示本身是没有价值的,首先,许多代码点具有语义意义,因此理解文本意味着你必须处理多个代码点。其次,一些平台如Windows使用UTF-16作为 wchar_t 编码,这意味着单个 wchar_t 甚至必然是码点值。 (使用UTF-16这种方式甚至符合标准是不明确的标准要求每个字符支持的语言环境可以表示为单个 wchar_t 值;如果没有locale支持BMP外的任何字符,则UTF-16可以被视为一致。)


I have a function to read the value of one variable (integer, double, or boolean) on a single line in an ifstream:

template <typename Type>
void readFromFile (ifstream &in, Type &val)
{
  string str;
  getline (in, str);
  stringstream ss(str);
  ss >> val;
}

However, it fails on text files created with editors inserting a BOM (byte order mark) at the beginning of the first line, which unfortunately includes {Note,Word}pad. How can I modify this function to ignore the byte-order mark if present at the beginning of str?

解决方案

(I'm assuming you're on Windows, since using U+FEFF as a signature in UTF-8 files is mostly a Windows thing and should simply be avoided elsewhere)

You could open the file as a UTF-8 file and then check to see if the first character is U+FEFF. You can do this by opening a normal char based fstream and then use wbuffer_convert to treat it as a series of code units in another encoding. VS2010 doesn't yet have great support for char32_t so the following uses UTF-16 in wchar_t.

std::fstream fs(filename);
std::wbuffer_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> wb(fs.rdbuf());
std::wistream is(&wb);
// if you don't do this on the stack remember to destroy the objects in reverse order of creation. is, then wb, then fs.
std::wistream::int_type ch = is.get();
const std::wistream::int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF
if(ZERO_WIDTH_NO_BREAK_SPACE != ch)
    is.putback(ch);

// now the stream can be passed around and used without worrying about the extra character in the stream.

int i;
readFromStream<int>(is,i);

Remember that this should be done on the file stream as a whole, not inside readFromFile on your stringstream, because ignoring U+FEFF should only be done if it's the very first character in the whole file, if at all. It shouldn't be done anywhere else.

On the other hand, if you're happy using a char based stream and just want to skip U+FEFF if present then James Kanze suggestion seems good so here's an implementation:

std::fstream fs(filename);
char a,b,c;
a = fs.get();
b = fs.get();
c = fs.get();
if(a!=(char)0xEF || b!=(char)0xBB || c!=(char)0xBF) {
    fs.seekg(0);
} else {
    std::cerr << "Warning: file contains the so-called 'UTF-8 signature'\n"
}


Additionally if you want to use wchar_t internally the codecvt_utf8_utf16 and codecvt_utf8 facets have a mode that can consume 'BOMs' for you. The only problem is that wchar_t is widely recognized to be worthless these days* and so you probably shouldn't do this.

std::wifstream fin(filename);
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header));

* wchar_t is worthless because it is specified to do just one thing; provide a fixed size data type that can represent any code point in a locale's character repertoire. It does not provide a common representation between locales (i.e., the same wchar_t value can be different characters in different locales so you cannot necessarily convert to wchar_t, switch to another locale, and then convert back to char in order to do iconv-like encoding conversions.)

The fixed sized representation itself is worthless for two reasons; first, many code points have semantic meanings and so understanding text means you have to process multiple code points anyway. Secondly, some platforms such as Windows use UTF-16 as the wchar_t encoding, which means a single wchar_t isn't even necessarily a code point value. (Whether using UTF-16 this way is even conformant to the standard is ambiguous. The standard requires that every character supported by a locale be representable as a single wchar_t value; If no locale supports any character outside the BMP then UTF-16 could be seen as conformant.)

这篇关于忽略C ++中的字节顺序标记,从流中读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆