如何处理CSV行与一些元素中的nul字符? [英] How to process CSV lines with nul char in some elements?

查看:615
本文介绍了如何处理CSV行与一些元素中的nul字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当读取和解析CSV文件行时,我需要处理显示为某些行字段值的nul字符。这是复杂的事实,有时CSV文件是在Windows-1250编码,有时它的UTF-8,有时是UTF-16。因为这一点,我开始了一些方法,然后找到了nul的char问题 - 见下文。

When reading and parsing a CSV-file line, I need to process the nul character that appears as the value of some row fields. It is complicated by the fact that sometimes the CSV file is in windows-1250 encoding, sometimes it in UTF-8, and sometimes UTF-16. Because of this, I have started some way, and then found the nul char problem later -- see below.

详细:将第三方的CSV文件清理为我们的数据提取程序通用的表单(即实用程序用作过滤器 - 将一个CSV表单存储为另一个CSV表单)。

Details: I need to clean a CSV files from third party to the form common to our data extractor (that is the utility works as a filter -- storing one CSV form to another CSV form).

我的初始方法是以二进制模式打开CSV文件,并检查第一个字节是否构成BOM。我知道所有给定的Unicode文件以BOM开头。如果没有BOM,我知道它是在Windows-1250编码。
转换后的CSV文件应该使用windows-1250编码。所以,检查输入文件后,我使用相关模式打开它,如下所示:

My initial approach was to open the CSV file in binary mode and check whether the first bytes form BOM. I know all the given Unicode files start with BOM. If there is no BOM, I know that it is in windows-1250 encoding. The converted CSV file should use the windows-1250 encoding. So, after checking the input file, I open it using the related mode, like this:

// Open the file in binary mode first to see whether BOM is there or not.
FILE * fh{ nullptr };
errno_t err = fopen_s(&fh, fnameIn.string().c_str(), "rb"); // const fs::path & fnameIn
assert(err == 0);
vector<char> buf(4, '\0');
fread(&buf[0], 1, 3, fh);
::fclose(fh);

// Set the isUnicode flag and open the file according to that.
string mode{ "r" };     // init 
bool isUnicode = false; // pessimistic init

if (buf[0] == 0xEF && buf[1] == 0xBB && buf[2] == 0xBF) // UTF-8 BOM
{
    mode += ", ccs=UTF-8";
    isUnicode = true;
}
else if ((buf[0] == 0xFE && buf[1] == 0xFF)     // UTF-16 BE BOM
      || (buf[0] == 0xFF && buf[1] == 0xFE))    // UTF-16 LE BOM
{
    mode += ", ccs=UNICODE";
    isUnicode = true;
}

// Open in the suitable mode.
err = fopen_s(&fh, fnameIn.string().c_str(), mode.c_str());
assert(err == 0);

成功打开后,输入行被读取或通过 fgets 或通过 fgetws - 取决于是否检测到Unicode。然后,如果早期检测到unicode,或者让缓冲区在1250中,那么想法是将缓冲区内容从Unicode转换为1250. s 变量应包含windows-1250编码。 ATL :: CW2A(buf,1250)用于需要转换时:

After the successful open, the input line is read or via fgets or via fgetws -- depending on whether Unicode was detected or not. Then the idea was to convert the buffer content from Unicode to 1250 if the unicode was detected earlier, or let the buffer be in 1250. The s variable should contain the string in the windows-1250 encoding. The ATL::CW2A(buf, 1250) is used when conversion is needed:

    const int bufsize = 4096;
    wchar_t buf[bufsize];

    // Read the line from the input according to the isUnicode flag.
    while (isUnicode ? (fgetws(buf, bufsize, fh) != NULL)
        : (fgets(reinterpret_cast<char*>(buf), bufsize, fh) != NULL))
    {
        // If the input is in Unicode, convert the buffer content
        // to the string in cp1250. Otherwise, do not touch it.
        string s;
        if (isUnicode)  s = ATL::CW2A(buf, 1250);
        else            s = reinterpret_cast<char*>(buf);
        ...
        // Now processing the characters of the `s` to form the output file
    }

它工作正常...,直到出现一个nul字符用作行中的值的文件。问题是,当分配 s 变量时, nul 剪切该行的其余部分。在观察的情况下,它发生与使用1250编码的文件。但是它可能也发生在UTF编码文件中。

It worked fine... until a file with a nul character used as the value in the row appeared. The problem is that when the s variable is assigned, the nul cuts the rest of the line. In the observed case, it happened with the file that used 1250 encoding. But it can probably happen also in the UTF encoded files.

如何解决这个问题?

推荐答案

NUL字符问题可以通过使用C ++或Windows函数来解决。在这种情况下,最简单的解决方案是 MultiByteToWideChar ,它将接受明确的字符串长度,因此它不会停止在NUL。

The NUL character problem is solved by using either C++ or Windows functions. In this case, the easiest solution is MultiByteToWideChar which will accept an explicit string length, precisely so it doesn't stop on NUL.

这篇关于如何处理CSV行与一些元素中的nul字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆