如何从C stdio.h getline()替换/忽略无效的Unicode/UTF8字符? [英] How to replace/ignore invalid Unicode/UTF8 characters � from C stdio.h getline()?

查看:81
本文介绍了如何从C stdio.h getline()替换/忽略无效的Unicode/UTF8字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python上, open Python函数:

On Python, there is this option errors='ignore' for the open Python function:

open( '/filepath.txt', 'r', encoding='UTF-8', errors='ignore' )

通过此操作,读取具有无效UTF8字符的文件将不会用任何内容替换它们,即它们将被忽略.例如,带有字符Føö»BÃ¥r的文件将被读取为FøöBår.

With this, reading a file with invalid UTF8 characters will replace them with nothing, i.e., they are ignored. For example, a file with the characthers Føö»BÃ¥r is going to be read as FøöBår.

如果从stdio.h中用getline()读取了Føö»BÃ¥r行,则将其读取为Føö�Bår:

If a line as Føö»BÃ¥r is read with getline() from stdio.h, it will be read as Føö�Bår:

FILE* cfilestream = fopen( "/filepath.txt", "r" );
int linebuffersize = 131072;
char* readline = (char*) malloc( linebuffersize );

while( true )
{
    if( getline( &readline, &linebuffersize, cfilestream ) != -1 ) {
        std::cerr << "readline=" readline << std::endl;
    }
    else {
        break;
    }
}

如何使stdio.h getline()读取为FøöBår而不是Føö�Bår,即忽略无效的UTF8字符?

How can I make stdio.h getline() read it as FøöBår instead of Føö�Bår, i..e, ignoring invalid UTF8 characters?

我可以想到一个压倒性的解决方案,它会在读取的每一行上遍历所有字符,并构建一个不包含任何这些字符的新readline.例如:

One overwhelming solution I can think of it do iterate throughout all characters on each line read and build a new readline without any of these characters. For example:

FILE* cfilestream = fopen( "/filepath.txt", "r" );
int linebuffersize = 131072;
char* readline = (char*) malloc( linebuffersize );
char* fixedreadline = (char*) malloc( linebuffersize );

int index;
int charsread;
int invalidcharsoffset;

while( true )
{
    if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
    {
        invalidcharsoffset = 0;
        for( index = 0; index < charsread; ++index )
        {
            if( readline[index] != '�' ) {
                fixedreadline[index-invalidcharsoffset] = readline[index];
            } 
            else {
                ++invalidcharsoffset;
            }
        }
        std::cerr << "fixedreadline=" << fixedreadline << std::endl;
    }
    else {
        break;
    }
}

相关问题:

  1. 修复无效的UTF8字符
  2. 替换非UTF8字符
  3. python替换unicode字符
  4. Python unicode:如何用空格替换无法使用utf8解码的字符?
  1. Fixing invalid UTF8 characters
  2. Replacing non UTF8 characters
  3. python replace unicode characters
  4. Python unicode: how to replace character that cannot be decoded using utf8 with whitespace?

推荐答案

您正在将看到的内容与实际发生的事情混淆. getline函数不执行任何字符替换. [注1]

You are confusing what you see with what is really going on. The getline function does not do any replacement of characters. [Note 1]

您看到一个替换字符(U + FFFD),因为当要求控制台呈现无效的UTF-8代码时,您的控制台会输出该字符.如果大多数控制台都处于UTF-8模式,则将执行此操作.也就是说,当前语言环境为UTF-8.

You are seeing a replacement character (U+FFFD) because your console outputs that character when it is asked to render an invalid UTF-8 code. Most consoles will do that if they are in UTF-8 mode; that is, the current locale is UTF-8.

另外,说一个文件包含"characters Føö»BÃ¥r"字符.充其量是不精确的.文件实际上不包含字符.它包含一些字节序列,根据某种编码,这些字节序列可以解释为字符(例如,通过控制台或其他用户呈现软件,将其呈现为字形).不同的编码产生不同的结果.在这种特殊情况下,您有一个文件,该文件是由软件使用Windows-1252编码(或大致等同于ISO 8859-15)创建的,并且正在使用UTF-8在控制台上进行渲染.

Also, saying that a file contains the "characters Føö»BÃ¥r" is at best imprecise. A file does not really contain characters. It contains byte sequences which may be interpreted as characters -- for example, by a console or other user presentation software which renders them into glyphs -- according to some encoding. Different encodings produce different results; in this particular case, you have a file which was created by software using the Windows-1252 encoding (or, roughly equivalently, ISO 8859-15), and you are rendering it on a console using UTF-8.

这意味着getline读取的数据包含无效的UTF-8序列,但是(可能)不包含替换字符代码.根据您显示的字符串,它包含十六进制字符\xbb,这是Windows代码页1252中的海雀(»).

What that means is that the data read by getline contains an invalid UTF-8 sequence, but it (probably) does not contain the replacement character code. Based on the character string you present, it contains the hex character \xbb, which is a guillemot (») in Windows code page 1252.

要在getline读取的字符串(或任何其他读取文件的C库函数)中查找所有无效的UTF-8序列,则需要扫描该字符串,但无需扫描特定的代码序列.相反,您需要一次解码一个UTF-8序列,以查找无效的序列.这不是一个简单的任务,但是 mbtowc 函数可以提供帮助(如果您启用了UTF-8语言环境).如您将在链接的联机帮助页中看到的,mbtowc返回包含在有效多字节序列"中的字节数. (在UTF-8语言环境中为UTF-8),或-1表示无效或不完整的序列.在扫描中,您应该按有效顺序遍历字节,或者删除/忽略开始无效序列的单个字节,然后继续扫描直到到达字符串末尾.

Finding all the invalid UTF-8 sequences in a string read by getline (or any other C library function which reads files) requires scanning the string, but not for a particular code sequence. Rather, you need to decode UTF-8 sequences one at a time, looking for the ones which are not valid. That's not a simple task, but the mbtowc function can help (if you have enabled a UTF-8 locale). As you'll see in the linked manpage, mbtowc returns the number of bytes contained in a valid "multibyte sequence" (which is UTF-8 in a UTF-8 locale), or -1 to indicate an invalid or incomplete sequence. In the scan, you should pass through the bytes in a valid sequence, or remove/ignore the single byte starting an invalid sequence, and then continue the scan until you reach the end of the string.

下面是一些经过简单测试的示例代码(用C语言编写):

Here's some lightly-tested example code (in C):

#include <stdlib.h>
#include <string.h>

/* Removes in place any invalid UTF-8 sequences from at most 'len' characters of the
 * string pointed to by 's'. (If a NUL byte is encountered, conversion stops.)
 * If the length of the converted string is less than 'len', a NUL byte is
 * inserted.
 * Returns the length of the possibly modified string (with a maximum of 'len'),
 * not including the NUL terminator (if any).
 * Requires that a UTF-8 locale be active; since there is no way to test for
 * this condition, no attempt is made to do so. If the current locale is not UTF-8,
 * behaviour is undefined.
 */
size_t remove_bad_utf8(char* s, size_t len) {
  char* in = s;
  /* Skip over the initial correct sequence. Avoid relying on mbtowc returning
   * zero if n is 0, since Posix is not clear whether mbtowc returns 0 or -1.
   */
  int seqlen;
  while (len && (seqlen = mbtowc(NULL, in, len)) > 0) { len -= seqlen; in += seqlen; }
  char* out = in;

  if (len && seqlen < 0) {
    ++in;
    --len;
    /* If we find an invalid sequence, we need to start shifting correct sequences.  */
    for (; len; in += seqlen, len -= seqlen) {
      seqlen = mbtowc(NULL, in, len);
      if (seqlen > 0) {
        /* Shift the valid sequence (if one was found) */
        memmove(out, in, seqlen);
        out += seqlen;
      }
      else if (seqlen < 0) seqlen = 1;
      else /* (seqlen == 0) */ break;
    }
    *out++ = 0;
  }
  return out - s;
}


注释

  1. 除了底层I/O库的可能的行尾转换外,它还将在Windows之类的系统上使用单个\n替换CR-LF,其中两个字符的CR-LF序列用作行尾指示.
  1. Aside from the possible line-end transformation of the underlying I/O library, which will replace CR-LF with a single \n on systems like Windows where the two character CR-LF sequence is used as a line-end indication.

这篇关于如何从C stdio.h getline()替换/忽略无效的Unicode/UTF8字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆