用C ++解析unicode文件 [英] Parsing unicode files in C++

查看:92
本文介绍了用C ++解析unicode文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


我在处理为代码添加unicode支持时遇到了麻烦,我完全使用了ASCII代码,但随后又输入了一些中文和韩文版本,使其中断。我尝试在网上搜索示例代码或指南,但没有足够的正确信息,所以寻找已经工作过的人,可以帮我修复它。我有简单的任务要做

- 输入unicode(UTF16)文本文件

- 逐行扫描然后将其解析为令牌

- 使用分隔符过滤需要的标记并忽略休息

- 将这些标记存储在数组中,如结构并对其进行一些字符串比较。


$ b $我正在使用Windows平台和Code :: block编辑器与mingw

i粘贴下面的代码的一部分,任何帮助非常感谢,如果你能给我一些很棒的示例代码。

----------------------------------------

Hi I am really having trouble dealing with adding unicode support to my code, i have perfectly working ASCII code but then with entry of some chinese and korean charectors making it break. I tried on web to search for sample code or guide but there is no enough proper info so looking for someone who already worked and can help me fix it. I have simple task to do
- input unicode (UTF16) text file
- scan it line by line and then parse it into tokens
- using delimiters filter the tokens that need and ignore rest
- store these tokens in array like stucture and do some string comparisons on it.

i am using windows platform and Code::block editor with mingw
i am pasting some part of code below , any help greatly appreciated and if you could give me sample code that would be great.
----------------------------------------

#include <iostream>
#include <windows.h>
#include <string.h>
#include <algorithm>
#include <cstring>
#include <fstream>
const int MAX_CHARS_PER_LINE = 4072;  
const int MAX_TOKENS_PER_LINE = 1;      
const wchar_t* const DELIMITER = L"\"";

class IntegrityCheck
{
    public:
        std::wstring Profile_Container[5000][4];
        void Profile_PRD_Parser();
};

 void IntegrityCheck::Profile_PRD_Parser()
{

std::wstring skip (L".exe");
std::wstring databoxtemp[1][1];
int a=-1;

// create a file-reading object
wifstream fin.open("profiles.prd");  //open a file
wofstream fout("out.txt");  // this dumps the parsing ouput 

// read each line of the file
while (!fin.eof())
{
    // read an entire line into memory
    wchar_t buf[MAX_CHARS_PER_LINE];

    fin.getline(buf, MAX_CHARS_PER_LINE);

    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index

    // array to store memory addresses of the tokens in buf
    const wchar_t* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0

    // parse the line
    token[0] = wcstok(buf, DELIMITER); // first token

    if (token[0]) // zero if line is blank
    {

        for (n = 0; n < MAX_TOKENS_PER_LINE; n++)   // setting n=0 as we want to ignore the first token
        {
            token[n] = wcstok(0, DELIMITER); // subsequent tokens

            if (!token[n]) break; // no more tokens

            std::wstring str2 =token[n];

            std::size_t found = str2.find(str);  //substring comparison

            if (found!=std::string::npos)   // if its exe then it writes in Dxout for same app name on new line
            {
                a++;
                Profile_Container[a][0]=token[n];
                std::transform(Profile_Container[a][2].begin(), Profile_Container[a][2].end(), Profile_Container[a][2].begin(), ::tolower);  //convert all data to lower 

                fout<<Profile_Container[a][0]<<"\t"<<Profile_Container[a][1]<<"\t"<<Profile_Container[a][2]<<"\n"; //write to file
            }

        }
    }

}

fout.close();
fin.close();
}

int main()
{
IntegrityCheck p1;
p1.Profile_PRD_Parser();
}     

推荐答案

你应该在调试器和代码示例文件中运行你的代码。



i会改变你在这个地方的代码:



you should run trough your code in the debugger and a code sample file.

i would change your code on this places:

// read an entire line into memory
    wchar_t buf[MAX_CHARS_PER_LINE] = {0};







 //first we done before
for (n = 1; n < MAX_TOKENS_PER_LINE; n++) 


切换到utf16通常不是要走的路。出于多种原因,使用utf8通常是更好的选择。在某些东方语言的情况下,它可能会消耗更多的内存,但如果我们考虑移植到其他平台(如unix)并与使用普通char指针的遗留代码集成(如原始代码),则通常更需要utf8 )。如果您使用utf8,那么您的原始代码将像魅力一样工作。通常原始的ascii解析器和文本处理器可以毫无问题地使用utf8数据。



而不是移植所有的文本处理逻辑,只需确定所有代码的工作原理使用utf8并为所有其他编码创建导入程序(外部编码为utf8转换器)和导出程序(utf8到外部编码转换器)。在您的情况下,这意味着用于加载的utf16到utf8转换器和用于保存的utf8到utf16转换器。更好的方法是以utf8格式获取文件,然后您不必转换。请注意,utf文件* MAY *以 BOM 开头[ ^ ]但这不是必需的,一些智能文本编辑器(如Notepad ++)通常可以检测编码,即使没有BOM但处理(检测/检查然后跳过)如果你以二进制模式读取文件,则可能需要加载器代码中的bom。
Switching to utf16 is often not the way to go. Using utf8 is often a better choice for many reasons. It is a representation that may consume a bit more memory in case of some eastern languages but utf8 is usually much more desirable if we consider porting to other platforms (like unix) and integration with legacy code that uses normal char pointers (like your original code). If you use utf8 then your original code will work like a charm. Often original ascii parsers and text processors work with utf8 data without any problems.

Instead of porting all of your text processing logic just make the decision that all of your code works with utf8 and create importers (foreign encoding to utf8 converters) and exporters (utf8 to foreign encoding converters) for all other encodings. In your case this means an utf16 to utf8 converter for loading and an utf8 to utf16 converter for saving. The better would be to get the file right in utf8 format, then you don't have to convert. Note that utf files *MAY* start with a BOM[^] but this isn't necessary, some smart text editors (like Notepad++) can often detect the encoding even without a BOM but handling (detecting/checking and then skipping) the bom in your loader code may be necessary if you read the file in binary mode.


这篇关于用C ++解析unicode文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆