忽略C ++中的字节顺序标记,从流中读取 [英] Ignore byte-order marks in C++, reading from a stream
问题描述
我有一个函数来读取 ifstream
中单行上一个变量(整数,双精度或布尔值)的值:
模板< typename类型>
void readFromFile(ifstream& in,Type& val)
{
string str;
getline(in,str);
stringstream ss(str);
ss>> val;
}
但是,对于使用编辑器插入BOM href =http://en.wikipedia.org/wiki/Byte_order_mark>字节顺序标记)在第一行的开头,不幸的是包括{Note,Word} pad。如何修改此函数以忽略字节顺序标记(如果存在于 str
?
(我假设你在Windows上,因为在UTF-8文件中使用U + FEFF作为签名主要是Windows的东西,应该简单地避免在其他地方)
您可以将文件打开为UTF-8文件,然后检查第一个字符是否为U + FEFF。你可以通过打开一个基于正常字符的fstream来实现,然后使用wbuffer_convert将它作为另一个编码中的一系列代码单元。 VS2010对char32_t还没有很好的支持,所以下面的代码在wchar_t中使用UTF-16。
std :: fstream fs文件名);
和
std :: wbuffer_convert< std :: codecvt_utf8_utf16< wchar_t>,wchar_t> wb(fs.rdbuf());
std :: wistream is(& wb);
//如果你不在堆栈上这样做,记得以创建的相反顺序销毁对象。是,然后是wb,然后是fs。
std :: wistream :: int_type ch = is.get();
const std :: wistream :: int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF
if(ZERO_WIDTH_NO_BREAK_SPACE!= ch)
is.putback(ch);
//现在流可以传递和使用,而不必担心流中的额外字符。
int i;
readFromStream< int>(is,i);请记住,这应该在文件流作为一个整体,而不是在您的stringstream的readFromFile内部完成,因为忽略U + FEFF只应该做到,如果它是整个文件中的第一个字符,如果有的话。
另一方面,如果你喜欢使用基于字符的流,只是想跳过U + FEFF(如果存在) James Kanze的建议似乎很好,所以这里有一个实现:
std :: fstream fs(filename);
char a,b,c;
a = fs.get();
b = fs.get();
c = fs.get();
if(a!=(char)0xEF || b!=(char)0xBB || c!=(char)0xBF){
fs.seekg(0)
} else {
std :: cerr<< 警告:文件包含所谓的'UTF-8签名'\\\
}
此外,如果您要在内部使用
wchar_t
codecvt_utf8_utf16codecvt_utf8
构面具有可以为您消耗BOM的模式。唯一的问题是wchar_t
被广泛认为是无价值的,这些天*,所以你可能不应该这样做。std :: wifstream fin(filename);
是无价值的它指定只做一件事情;提供可以表示语言环境的字符库中的任何代码点的固定大小的数据类型。它不会在区域设置之间提供常见的表示形式(即,相同的
fin.imbue(std :: locale(fin.getloc(),new std :: codecvt_utf8_utf16< wchar_t,0x10FFFF,std :: consume_header));
wchar_t
值可以是不同的语言环境中的不同字符,因此您不必将其转换为wchar_t
,切换到另一个语言环境,然后转换回char
以执行iconv
类编码转换。)
固定大小的表示本身是没有价值的,首先,许多代码点具有语义意义,因此理解文本意味着你必须处理多个代码点。其次,一些平台如Windows使用UTF-16作为
wchar_t
编码,这意味着单个wchar_t
甚至必然是码点值。 (使用UTF-16这种方式甚至符合标准是不明确的标准要求每个字符支持的语言环境可以表示为单个wchar_t
值;如果没有locale支持BMP外的任何字符,则UTF-16可以被视为一致。)I have a function to read the value of one variable (integer, double, or boolean) on a single line in an
ifstream
:template <typename Type> void readFromFile (ifstream &in, Type &val) { string str; getline (in, str); stringstream ss(str); ss >> val; }
However, it fails on text files created with editors inserting a BOM (byte order mark) at the beginning of the first line, which unfortunately includes {Note,Word}pad. How can I modify this function to ignore the byte-order mark if present at the beginning of
str
?解决方案(I'm assuming you're on Windows, since using U+FEFF as a signature in UTF-8 files is mostly a Windows thing and should simply be avoided elsewhere)
You could open the file as a UTF-8 file and then check to see if the first character is U+FEFF. You can do this by opening a normal char based fstream and then use wbuffer_convert to treat it as a series of code units in another encoding. VS2010 doesn't yet have great support for char32_t so the following uses UTF-16 in wchar_t.
std::fstream fs(filename); std::wbuffer_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> wb(fs.rdbuf()); std::wistream is(&wb); // if you don't do this on the stack remember to destroy the objects in reverse order of creation. is, then wb, then fs. std::wistream::int_type ch = is.get(); const std::wistream::int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF if(ZERO_WIDTH_NO_BREAK_SPACE != ch) is.putback(ch); // now the stream can be passed around and used without worrying about the extra character in the stream. int i; readFromStream<int>(is,i);
Remember that this should be done on the file stream as a whole, not inside readFromFile on your stringstream, because ignoring U+FEFF should only be done if it's the very first character in the whole file, if at all. It shouldn't be done anywhere else.
On the other hand, if you're happy using a char based stream and just want to skip U+FEFF if present then James Kanze suggestion seems good so here's an implementation:
std::fstream fs(filename); char a,b,c; a = fs.get(); b = fs.get(); c = fs.get(); if(a!=(char)0xEF || b!=(char)0xBB || c!=(char)0xBF) { fs.seekg(0); } else { std::cerr << "Warning: file contains the so-called 'UTF-8 signature'\n" }
Additionally if you want to use
wchar_t
internally thecodecvt_utf8_utf16
andcodecvt_utf8
facets have a mode that can consume 'BOMs' for you. The only problem is thatwchar_t
is widely recognized to be worthless these days* and so you probably shouldn't do this.std::wifstream fin(filename); fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header));
*
wchar_t
is worthless because it is specified to do just one thing; provide a fixed size data type that can represent any code point in a locale's character repertoire. It does not provide a common representation between locales (i.e., the samewchar_t
value can be different characters in different locales so you cannot necessarily convert towchar_t
, switch to another locale, and then convert back tochar
in order to doiconv
-like encoding conversions.)The fixed sized representation itself is worthless for two reasons; first, many code points have semantic meanings and so understanding text means you have to process multiple code points anyway. Secondly, some platforms such as Windows use UTF-16 as the
wchar_t
encoding, which means a singlewchar_t
isn't even necessarily a code point value. (Whether using UTF-16 this way is even conformant to the standard is ambiguous. The standard requires that every character supported by a locale be representable as a singlewchar_t
value; If no locale supports any character outside the BMP then UTF-16 could be seen as conformant.)这篇关于忽略C ++中的字节顺序标记,从流中读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!