我应该如何使用g ++的-finput-charset编译器选项正确,以编译非UTF-8源文件? [英] How should I use g++'s -finput-charset compiler option correctly in order to compile a non-UTF-8 source file?

查看:2146
本文介绍了我应该如何使用g ++的-finput-charset编译器选项正确,以编译非UTF-8源文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用-finput-charset编译器选项在g ++中编译UTF-16BE C ++源文件,但是我总是遇到一堆错误。更多详情如下。



我的环境(在CentOS Linux中):




  • g ++ :4.1.2

  • iconv:2.5

  • Linux语言(在终端):LANG =en_US.UTF-8



我的示例源文件(以UTF-16BE编码存储):



 code> // main.cpp:

#include< iostream>

int main()
{
std :: cout< Hello,UTF-16<< std :: endl;
return 0;
}



我的步骤:




  • 我阅读了关于-finput-charset选项的g ++手册。 g ++手册说:




-finput-charset = charset
设置输入字符集,用于从输入文件的字符集翻译为
GCC使用的源字符集。如果语言区域未指定,或GCC无法从语言区域获取此
信息,则默认值为UTF-8 。这可以是
被语言环境或此命令行选项覆盖。
目前,如果存在
冲突,则命令行选项优先。 charset可以是系统的
iconv 库例程支持的任何编码。





  • 因此,我输入命令如下:




g ++ -finput-charset = UTF-16BE main.cpp


我遇到了以下错误:


在main.cpp中包含的文件中:1:



/ usr / lib / gcc / i386-redhat- linux / 4.1.2 /../../../../ include / c ++ / 4.1.2 / iostream:1:
错误:stray'\342'in program



/usr/lib/gcc/i386-redhat-linux/4.1.2 /../../../../ include / c ++ / 4.1.2 / iostream:1:
错误:stray'\274'in program



...(重复,A LOT,约4000+)...



/usr/lib/gcc/i386-redhat-linux/4.1.2 /../../../../ include / c ++ / 4.1.2 / iostream:1 :
error:stray'\257'in program



main.cpp:在函数'int main()':



main.cpp:5:错误:'cout'不是'std'的成员



main.cpp:5: 'endl'不是'std'的成员





  • 手册文本表明charset可以是'iconv'例程支持的任何编码,因此我猜测编译错误可能是由我的iconv库导致的。然后我测试了iconv:




iconv --from-code = UTF-16BE --to-code = UTF-8 --output = main_utf8.cpp main.cpp


main_utf8.cpp文件按预期生成。然后我尝试编译它:


g ++ -finput-charset = UTF-8 main_utf8.cpp


请注意,我明确指定了输入字符集,以查看我是否做错了,但这次生成的a.out没有任何错误。当我运行它,它可以产生正确的输出。



最后...



找出我做错了。我在网上搜索,试图找出这个编译器选项的一些例子,但我不能。



请指教!谢谢!



进一步编辑:



您的回复很快!一些更新:


  1. 当我说UTF-16时,我的意思是UTF-16 + BOM。事实上我使用UTF-16BE。我已更新上面的文本。

  2. 有些答案说,错误是由非UTF-16头文件引起的。这里是我的想法,如果这是case:我们将始终包括一些标准头文件,当编写一个C / C ++项目,对吧?如stdio.h或iostream。如果G ++编译器只处理由我们创建的源文件的编码,但从来没有使用标准库中的源文件,那么

  3. ol>

    最后编辑:



    最后,我的解决方案如下:


    1. 开始时,我将源文件的编码更改为GB2312,如下所述Lister先生。这工作正常一段时间,但后来我发现它不适合我的情况,因为系统中的大多数其他部分仍然使用UTF-8的通信和接口,因此我必须转换编码在许多地方...不只是

    2. 后来我试图将我的所有源文件转换为UTF-8 + BOM。这样,Windows中的Visual Studio可以愉快地编译它们,但Linux中的GCC会抱怨。然后我写了一个shell脚本来删除BOM,在我想用GCC编译我的代码之前,我首先运行这个脚本。

    3. 幸运的是,我不需要构建代码在Linux中手动执行,因为TeamCity的持续集成工具在我的项目中用于自动生成构建。我可以更改TeamCity中的构建步骤,以帮助我在每日构建开始之前运行此脚本。

    4. 使用此UTF-8 + BOM +脚本方法,我决定不编辑我的源代码在Linux中,因为如果我想这样做,我必须确保我的代码可以在提交之前成功构建,这意味着我必须运行脚本删除BOM之前,我构建代码,这意味着SVN将报告EVERY文件是修改(BOM删除),从而使得很容易错误地提交错误的文件。为了解决这个问题,我写了另一个shell脚本来添加BOM回源文件。虽然我仍然不在Linux中经常编辑我的代码,但是当我真的需要,我不必面对提交对话框中非常长的更改列表。


    解决方案

    编码蓝调



    不能对源代码文件使用UTF-因为包括< iostream> 的标头不是UTF-16编码的。因为 #include 逐字包括文件,这意味着你突然有一个UTF-16编码的文件,有一个大块(显然是大约4k,无效数据)。

    几乎没有什么好的理由来使用UTF-16来处理任何事情,所以这也是一样。



    修改:有关编码支持的问题:操作系统本身不负责提供编码支持,这归功于所使用的编译器。



    g ++在Windows上,绝对支持与g ++相同的编码,因为它是相同的程序,除非你使用的任何版本的g ++,在Windows上依赖于一个深深的iconv库。



    检查您的工具链,确保所有工具正常运作。



    不要在源文件中使用中文,而是使用英语文字或简单的 TOKEN_STYLE_PLACEHOLDER 使用 l10n i18n 可替换正在运行的可执行文件中的这些。



    Threedit: strong> -finput-charset 几乎肯定是一个从代码页和其他废话的时代开始的保持;然而; ISO-8859-n文件几乎总是与UTF-8标准标题兼容,但是,请参阅下面的reedit。



    Reedit:下次;记住一个简单的口头禅:N'DUUH! 从不使用UTF-8!






    I18N



    这种问题的常见解决方案是通过

    当使用gettext时,通常最终会有一个函数 loc(char *)摘要大部分翻译工具特定代码。所以,而不是

      #include< iostream> 

    int main(){
    std :: cout< 瓜田李下< std :: endl;
    }

    您将拥有

      #include< iostream> 

    #includetranslation.h

    int main(){
    std :: cout< loc(DEEPER_MEANING)< std :: endl;
    }

    zh.po

      msgid DEEPER_MEANING 
    msgstr瓜田李下

    当然,你也可以有一个 en.po

      msgid DEEPER_MEANING 
    msgstr仍然水深深

    这可以扩展,gettext包有扩展带有变量等的字符串的工具,或者你可以使用 printf






    第三个选项



    必须处理对文件编码,文件结尾,字节顺序标记和其他类型问题有不同要求的多个编译器;可以使用 MinGW 或类似工具进行交叉编译。



    此选项需要一些设置,但可能会很好地减少未来的开销和头痛。


    I'm trying to compile a UTF-16BE C++ source file in g++ with -finput-charset compiler option but I'm always getting a bunch of errors. More details follow.

    My environment(in CentOS Linux):

    • g++: 4.1.2
    • iconv: 2.5
    • Linux language(in Terminal): LANG="en_US.UTF-8"

    My sample source file(stored in UTF-16BE encoding):

    // main.cpp:
    
    #include <iostream>
    
    int main()
    {
        std::cout << "Hello, UTF-16" << std::endl;
        return 0;
    }
    

    My steps:

    • I read the manual of g++ about the -finput-charset option. The g++ manual says:

    -finput-charset=charset Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there’s a conflict. charset can be any encoding supported by the system’s "iconv" library routine.

    • Thus I entered the command as follows:

    g++ -finput-charset=UTF-16BE main.cpp

    and I got these errors:

    In file included from main.cpp:1:

    /usr/lib/gcc/i386-redhat-linux/4.1.2/../../../../include/c++/4.1.2/iostream:1: error: stray ‘\342’ in program

    /usr/lib/gcc/i386-redhat-linux/4.1.2/../../../../include/c++/4.1.2/iostream:1: error: stray ‘\274’ in program

    ...(repeatedly, A LOT, around 4000+)...

    /usr/lib/gcc/i386-redhat-linux/4.1.2/../../../../include/c++/4.1.2/iostream:1: error: stray ‘\257’ in program

    main.cpp: In function ‘int main()’:

    main.cpp:5: error: ‘cout’ is not a member of ‘std’

    main.cpp:5: error: ‘endl’ is not a member of ‘std’

    • The manual text suggests that the charset can be any encoding supported by 'iconv' routine, thus I guessed the compilation errors might be caused by my iconv library. I then tested the iconv:

    iconv --from-code=UTF-16BE --to-code=UTF-8 --output=main_utf8.cpp main.cpp

    A "main_utf8.cpp" file is generated as expected. I then tried to compile it:

    g++ -finput-charset=UTF-8 main_utf8.cpp

    Note that I specified the input-charset explicitly to see if I did anything wrong, but this time a "a.out" was generated without any errors. When I ran it, it could produce the correct output.

    Finally...

    I couldn't figure out where I did wrong. I searched in the web trying to find out some examples for this compiler option but I couldn't.

    Please advise! Thanks!

    Further edits:

    Thanks, guys! Your replies are quick! Some updates:

    1. When I said "UTF-16" I meant "UTF-16 + BOM". In fact I used UTF-16BE. I have updated the text above.
    2. Some answers say the errors are caused by the non-UTF-16 header files. Here are my thoughts if this is the case: We'll always include some standard header files when writing a C/C++ project, right? Such as stdio.h or iostream. If the G++ compiler only deals with the encoding of the source files created by us but never with the source files in the standard library, then what does this -finput-charset option exist for??

    Final edit:

    At last, my solution is like this:

    1. At the beginning, I changed the encoding of my source files to GB2312, as "Mr Lister" said below. This worked fine for a while, but later I found it not suitable for my situation because most of the other parts in the system still use UTF-8 for communication and interfaces, thus I must convert the encoding in many places... Not only an overhead of my work, it may also result in some performance decrease in my program.
    2. Later I tried to convert all my source files to UTF-8 + BOM. In this way, Visual Studio in Windows could compile them happily but GCC in Linux would complain. I then wrote a shell script to remove the BOM, and before I want to compile my code with GCC, I run this script first.
    3. Luckily, I don't have to build the code in Linux manually because TeamCity the continuous integration tool is used in my project to generate the build automatically. I could change the build steps in TeamCity to help me run this script before the daily build starts.
    4. With this UTF-8 + BOM + script method, I decide not to edit my source code in Linux, because if I want to do so, I must make sure my code could build successfully before I commit it, which means I must run the script to remove the BOM before I build the code, which means SVN would report EVERY file is modified(BOM removed) thus make it very easy to mistakenly commit a wrong file. To solve this problem, I wrote another shell script to add the BOM back to the source files. Though I still don't edit my code very often in Linux, but when I really need to, I don't have to face the terribly long change list in the commit dialog.

    解决方案

    Encoding Blues

    You cannot use UTF-16 for source code files; because the header you are including, <iostream>, is not UTF-16-encoded. As #include includes the files verbatim, this means that you suddenly have an UTF-16-encoded file with a large chunk (approximately 4k, apparently) of invalid data.

    There is almost no good reason to ever use UTF-16 for anything, so this is just as well.

    Edit: Regarding problems with encoding support: The OSes themselves are not responsible for providing encoding support, this comes down to the compilers used.

    g++ on Windows supports absolutely all of the same encodings as g++ on Linux, because it's the same program, unless whatever version of g++ you are using on Windows relies on a deeply broken iconv library.

    Inspect your toolchain and ensure that all your tools are in working order.

    As an alternative; don't use Chinese in the source files, but write them in English, using English-language literals, or simple TOKEN_STYLE_PLACEHOLDERs, using l10n and i18n to replace these in the running executable.

    Threedit: -finput-charset is almost certainly a holdover from the days of codepages and other nonsense of the kind; however; an ISO-8859-n file will almost always be compatible with UTF-8 standard headers, however, see the reedit below.

    Reedit: For next time; remember a simple mantra: "N'DUUH!"; "Never Don't Use UTF-8!"


    I18N

    A common solution to this kind of problem is to remove the problem entirely, by way of, for instance, gettext.

    When using gettext, you usually end up with a function loc(char *) that abstracts away most of the translation tool specific code. So, instead of

    #include <iostream>
    
    int main () {
      std::cout << "瓜田李下" << std::endl;
    }
    

    you would have

    #include <iostream>
    
    #include "translation.h"
    
    int main () {
      std::cout << loc("DEEPER_MEANING") << std::endl;
    }
    

    and, in zh.po:

    msgid DEEPER_MEANING
    msgstr "瓜田李下"
    

    Of course, you could also then have a en.po:

    msgid DEEPER_MEANING
    msgstr "Still waters run deep"
    

    This can be expanded upon, and the gettext package has tools for expansion of strings with variables and such, or you could use printf, to account for different grammars.


    The Third Option

    Instead of having to deal with multiple compilers with different requirements for file encodings, file endings, byte order marks, and other problems of the kind; it is possible to cross-compile using MinGW or similar tools.

    This option requires some setup, but may very well reduce future overhead and headaches.

    这篇关于我应该如何使用g ++的-finput-charset编译器选项正确,以编译非UTF-8源文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆