得到错误而读UNI code文件用C [英] Getting error while reading unicode file in C

查看:314
本文介绍了得到错误而读UNI code文件用C的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用下面的code读取C(Cygwin的/ GCC)一单code文件:

I want to read a unicode file in C (Cygwin/GCC) using the following code:

#include <stdio.h>
#include <stdlib.h>
#include <glib.h>


void split_parse(char* text){
    char** res = g_strsplit(text, "=", 2);
    printf("Key = %s : ", res[0]);
    printf("Value = %s", res[1]);
    printf("\n");
}

int main(int argc, char **argv)
{
    setenv ("CYGWIN", "nodosfilewarning", 1);

    GIOChannel *channel;
    GError *err = NULL;
    int reading = 0;
    const gchar* enc;
    guchar magic[2] = { 0 };
    gsize bytes_read = 0;

    const char* filename = "C:\\CONFIG";


    channel = g_io_channel_new_file (filename, "r", &err);

    if (!channel) {
        g_print("%s", err->message);
        return 1;
    }

    if (g_io_channel_set_encoding(channel, NULL, &err) != G_IO_STATUS_NORMAL) {
        g_print("g_io_channel_set_encoding: %s\n", err->message);
        return 1;
    }

    if (g_io_channel_read_chars(channel, (gchar*) magic, 2, &bytes_read, &err) != G_IO_STATUS_NORMAL) {
        g_print("g_io_channel_read_chars: %s\n", err->message);
        return 1;
    }

    if (magic[0] == 0xFF && magic[1] == 0xFE)
    {
        enc = "UTF-16LE";
    }
    else if (magic[0] == 0xFE && magic[1] == 0xFF)
    {
        enc = "UTF-16BE";
    }
    else
    {
        enc = "UTF-8";
        if (g_io_channel_seek_position(channel, 0, G_SEEK_CUR, &err) == G_IO_STATUS_ERROR)
        {
            g_print("g_io_channel_seek: failed\n");
            return 1;
        }
    }

    if (g_io_channel_set_encoding (channel, enc, &err) != G_IO_STATUS_NORMAL) {
        g_print("%s", err->message);
        return 1;
    }

    reading = 1;
    GIOStatus status;
    char* str = NULL;
    size_t len;

    while(reading){

        status = g_io_channel_read_line(channel, &str, &len, NULL, &err);
        switch(status){
            case G_IO_STATUS_EOF:
                reading = 0;
                break;
            case G_IO_STATUS_NORMAL:
                if(len == 0) continue;
                split_parse(str);
                break;
            case G_IO_STATUS_AGAIN: continue;
            case G_IO_STATUS_ERROR:
            default:
                //throw error;
                reading = 0;
                break;
        }
    }

    g_free(str);
    g_io_channel_unref(channel);

    return(EXIT_SUCCESS);
}

文件(C:\\ CONFIG)内容如下:

The file (C:\CONFIG) content is as follows:

h-debug="1"
name=ME
ÃÆÿЮ©=2¾1¼

在阅读它,我总是在g_io_channel_read_linewhile循环中得到以下错误消息


  

0x800474f8,在转换输入无效的字节序列

0x800474f8 "Invalid byte sequence in conversion input"

我是什么做错了吗?如何用巧舌如簧来读取这样的文件,用C?

What am I doing wrong? How to read a file like this in C using glib?

编辑:文件hexdump都

推荐答案

您文件包含(EF BB BF)的3个字节的UTF8 BOM。字节顺序标记。

Your file contains the 3-byte UTF8 BOM of (EF BB BF). byte-order-mark.

您code默认为UTF8,但不消耗BOM。

Your code defaults to UTF8, but does not consume the BOM.

channel, 0, G_SEEK_CUR, &err

S / B

channel, 3, G_SEEK_CUR, &err

此外,我会建议延长你的魔术 code读取4个字节,肯定辨别BOM。

Further, I would recommend extending your magic code to read 4 bytes and affirmatively discern the BOM.

如果的你没有找到一个BOM,你可以假设编码NULL我认为这是二进制的。或抛出一个错误或修复任性的文本文件,或者,如果你是迂腐,依次尝试所有已知的编码类型。

If you do not find a BOM, you could assume encoding NULL which I think is binary. Or throw an error Or fix the wayward text file Or, if your are pedantic, sequentially try all known encoding types.

UTF32BE\\ X00 \\ X00 \\ XFE \\ XFF结果
UTF32LE\\ XFF \\ XFE \\ X00 \\ X00结果
UTF8\\ XEF \\ XBB \\ XBF结果
UTF16BE\\ XFE \\ XFF结果
UTF16LE\\ XFF \\ XFE结果
NULL二进制

UTF32BE "\x00\x00\xFE\xFF"
UTF32LE "\xFF\xFE\x00\x00"
UTF8 "\xEF\xBB\xBF"
UTF16BE "\xFE\xFF"
UTF16LE "\xFF\xFE"
NULL for binary

这篇关于得到错误而读UNI code文件用C的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆