如何从C stdio.h getline()替换/忽略无效的Unicode/UTF8字符? [英] How to replace/ignore invalid Unicode/UTF8 characters � from C stdio.h getline()?
问题描述
在Python上, open
Python函数:
On Python, there is this option errors='ignore'
for the open
Python function:
open( '/filepath.txt', 'r', encoding='UTF-8', errors='ignore' )
通过此操作,读取具有无效UTF8字符的文件将不会用任何内容替换它们,即它们将被忽略.例如,带有字符Føö»BÃ¥r
的文件将被读取为FøöBår
.
With this, reading a file with invalid UTF8 characters will replace them with nothing, i.e., they are ignored. For example, a file with the characthers Føö»BÃ¥r
is going to be read as FøöBår
.
如果从stdio.h
中用getline()
读取了Føö»BÃ¥r
行,则将其读取为Føö�Bår
:
If a line as Føö»BÃ¥r
is read with getline()
from stdio.h
, it will be read as Føö�Bår
:
FILE* cfilestream = fopen( "/filepath.txt", "r" );
int linebuffersize = 131072;
char* readline = (char*) malloc( linebuffersize );
while( true )
{
if( getline( &readline, &linebuffersize, cfilestream ) != -1 ) {
std::cerr << "readline=" readline << std::endl;
}
else {
break;
}
}
如何使stdio.h
getline()
读取为FøöBår
而不是Føö�Bår
,即忽略无效的UTF8字符?
How can I make stdio.h
getline()
read it as FøöBår
instead of Føö�Bår
, i..e, ignoring invalid UTF8 characters?
我可以想到一个压倒性的解决方案,它会在读取的每一行上遍历所有字符,并构建一个不包含任何这些字符的新readline
.例如:
One overwhelming solution I can think of it do iterate throughout all characters on each line read and build a new readline
without any of these characters. For example:
FILE* cfilestream = fopen( "/filepath.txt", "r" );
int linebuffersize = 131072;
char* readline = (char*) malloc( linebuffersize );
char* fixedreadline = (char*) malloc( linebuffersize );
int index;
int charsread;
int invalidcharsoffset;
while( true )
{
if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
{
invalidcharsoffset = 0;
for( index = 0; index < charsread; ++index )
{
if( readline[index] != '�' ) {
fixedreadline[index-invalidcharsoffset] = readline[index];
}
else {
++invalidcharsoffset;
}
}
std::cerr << "fixedreadline=" << fixedreadline << std::endl;
}
else {
break;
}
}
相关问题:
- Fixing invalid UTF8 characters
- Replacing non UTF8 characters
- python replace unicode characters
- Python unicode: how to replace character that cannot be decoded using utf8 with whitespace?
推荐答案
您正在将看到的内容与实际发生的事情混淆. getline
函数不执行任何字符替换. [注1]
You are confusing what you see with what is really going on. The getline
function does not do any replacement of characters. [Note 1]
您看到一个替换字符(U + FFFD),因为当要求控制台呈现无效的UTF-8代码时,您的控制台会输出该字符.如果大多数控制台都处于UTF-8模式,则将执行此操作.也就是说,当前语言环境为UTF-8.
You are seeing a replacement character (U+FFFD) because your console outputs that character when it is asked to render an invalid UTF-8 code. Most consoles will do that if they are in UTF-8 mode; that is, the current locale is UTF-8.
另外,说一个文件包含"characters Føö»BÃ¥r
"字符.充其量是不精确的.文件实际上不包含字符.它包含一些字节序列,根据某种编码,这些字节序列可以解释为字符(例如,通过控制台或其他用户呈现软件,将其呈现为字形).不同的编码产生不同的结果.在这种特殊情况下,您有一个文件,该文件是由软件使用Windows-1252编码(或大致等同于ISO 8859-15)创建的,并且正在使用UTF-8在控制台上进行渲染.
Also, saying that a file contains the "characters Føö»BÃ¥r
" is at best imprecise. A file does not really contain characters. It contains byte sequences which may be interpreted as characters -- for example, by a console or other user presentation software which renders them into glyphs -- according to some encoding. Different encodings produce different results; in this particular case, you have a file which was created by software using the Windows-1252 encoding (or, roughly equivalently, ISO 8859-15), and you are rendering it on a console using UTF-8.
这意味着getline读取的数据包含无效的UTF-8序列,但是(可能)不包含替换字符代码.根据您显示的字符串,它包含十六进制字符\xbb
,这是Windows代码页1252中的海雀(»
).
What that means is that the data read by getline contains an invalid UTF-8 sequence, but it (probably) does not contain the replacement character code. Based on the character string you present, it contains the hex character \xbb
, which is a guillemot (»
) in Windows code page 1252.
要在getline
读取的字符串(或任何其他读取文件的C库函数)中查找所有无效的UTF-8序列,则需要扫描该字符串,但无需扫描特定的代码序列.相反,您需要一次解码一个UTF-8序列,以查找无效的序列.这不是一个简单的任务,但是 mbtowc
函数可以提供帮助(如果您启用了UTF-8语言环境).如您将在链接的联机帮助页中看到的,mbtowc
返回包含在有效多字节序列"中的字节数. (在UTF-8语言环境中为UTF-8),或-1表示无效或不完整的序列.在扫描中,您应该按有效顺序遍历字节,或者删除/忽略开始无效序列的单个字节,然后继续扫描直到到达字符串末尾.
Finding all the invalid UTF-8 sequences in a string read by getline
(or any other C library function which reads files) requires scanning the string, but not for a particular code sequence. Rather, you need to decode UTF-8 sequences one at a time, looking for the ones which are not valid. That's not a simple task, but the mbtowc
function can help (if you have enabled a UTF-8 locale). As you'll see in the linked manpage, mbtowc
returns the number of bytes contained in a valid "multibyte sequence" (which is UTF-8 in a UTF-8 locale), or -1 to indicate an invalid or incomplete sequence. In the scan, you should pass through the bytes in a valid sequence, or remove/ignore the single byte starting an invalid sequence, and then continue the scan until you reach the end of the string.
下面是一些经过简单测试的示例代码(用C语言编写):
Here's some lightly-tested example code (in C):
#include <stdlib.h>
#include <string.h>
/* Removes in place any invalid UTF-8 sequences from at most 'len' characters of the
* string pointed to by 's'. (If a NUL byte is encountered, conversion stops.)
* If the length of the converted string is less than 'len', a NUL byte is
* inserted.
* Returns the length of the possibly modified string (with a maximum of 'len'),
* not including the NUL terminator (if any).
* Requires that a UTF-8 locale be active; since there is no way to test for
* this condition, no attempt is made to do so. If the current locale is not UTF-8,
* behaviour is undefined.
*/
size_t remove_bad_utf8(char* s, size_t len) {
char* in = s;
/* Skip over the initial correct sequence. Avoid relying on mbtowc returning
* zero if n is 0, since Posix is not clear whether mbtowc returns 0 or -1.
*/
int seqlen;
while (len && (seqlen = mbtowc(NULL, in, len)) > 0) { len -= seqlen; in += seqlen; }
char* out = in;
if (len && seqlen < 0) {
++in;
--len;
/* If we find an invalid sequence, we need to start shifting correct sequences. */
for (; len; in += seqlen, len -= seqlen) {
seqlen = mbtowc(NULL, in, len);
if (seqlen > 0) {
/* Shift the valid sequence (if one was found) */
memmove(out, in, seqlen);
out += seqlen;
}
else if (seqlen < 0) seqlen = 1;
else /* (seqlen == 0) */ break;
}
*out++ = 0;
}
return out - s;
}
注释
- 除了底层I/O库的可能的行尾转换外,它还将在Windows之类的系统上使用单个
\n
替换CR-LF,其中两个字符的CR-LF序列用作行尾指示.
- Aside from the possible line-end transformation of the underlying I/O library, which will replace CR-LF with a single
\n
on systems like Windows where the two character CR-LF sequence is used as a line-end indication.
这篇关于如何从C stdio.h getline()替换/忽略无效的Unicode/UTF8字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!