用Unicode C ++编码路径 [英] Coding a path in unicode c++

查看:90
本文介绍了用Unicode C ++编码路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在打开UTF-8路径文件时遇到问题.具有UTF-8字符(例如西里尔字母或拉丁字母)的路径.我找到了一种使用_wfopen解决此问题的方法,但是解决的方法是当我用手(\ Uxxxx)用UTF编码UTF-8字符时.

是否提供了函数,宏或任何提供字符串(路径)的东西,它将返回Unicode?

类似这样的事情: https://www.branah.com/unicode-converter

我尝试使用MultiByteToWideChar,但是它返回了一些不相关的十六进制数字.

尝试:

std::wstring s2ws(const std::string& s)
{
    int len;
    int slength = (int)s.length() + 1;
    len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
    wchar_t* buf = new wchar_t[len];
    MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
    std::wstring r(buf);
    delete[] buf;
    return r;
}
std::wstring stemp = s2ws(x);
LPCWSTR result = stemp.c_str();

我得到的结果:0055F7E8

提前谢谢

更新:

我安装了boost,现在我正在尝试通过boost做.有人可以帮我提振精神吗?

所以我有一条路: wchar_t path[100] = _T("čaćšžđ\\test.txt");

我需要将其转换为:

wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");

解决方案

这是在Windows上的UTF-8和UTF-16之间进行转换的一种方法,并显示输入和输出的已存储代码单元的实际值:

#include <codecvt>
#include <iostream>
#include <iomanip>
#include <string>

int main() {
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;

    std::string s = "test";

    std::cout << std::hex << std::setfill('0');
    std::cout << "Input `char` data: ";
    for (char c : s) {
      std::cout << std::setw(2) << static_cast<unsigned>(static_cast<unsigned char>(c)) << ' ';
    }
    std::cout << '\n';

    std::wstring ws = convert.from_bytes(s);

    std::cout << "Output `wchar_t` data: ";
    for (wchar_t wc : ws) {
      std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
    }
    std::cout << '\n';
}

理解输入和输出的实际值很重要,因为否则您可能无法正确理解您真正需要的转换.例如,在我看来,对于VC ++如何处理编码以及\Uxxxxxxxx\uxxxx在C ++源代码中的实际作用,可能有些困惑(例如,它们不一定产生UTF-8数据).

尝试使用上面显示的代码来查看输入数据的真实含义.


要强调我上面写的内容;有明显的迹象表明您可能无法正确理解输入上正在执行的处理,因此需要进行彻底检查.

如果将测试字符串替换为以下内容,则上述程序确实可以将ć(U + 0107)的UTF-8表示形式正确转换为单个16位代码单元0x0107:

std::string s = "\xC4\x87"; // UTF-8 representation of U+0107

在Windows上使用Visual Studio的程序输出为:

输入char数据:c4 87
输出wchar_t数据:0107

这与使用测试字符串(例如:

)形成对比

std::string s = "ć";

std::string s = "\u0107";

可能会导致以下输出:

输入char数据:3f
输出wchar_t数据:003f

这里的问题是Visual Studio不会使用UTF-8作为字符串的编码而没有一些麻烦,因此您从UTF-8转换的请求可能不是您真正需要的;或者您确实需要从UTF-8进行转换,但是您正在使用与实际输入不同的输入来测试潜在的转换例程.


所以我有一个路径:wchar_t path [100] = _T(čaćšžđ\ test.txt");

我需要将其转换为:

wchar_t s [100] = _T("\ u010d \ u0061 \ u0107 \ u016​​1 \ u017e \ u0111 \ test.txt");

好的,如果我理解正确,那么您的实际问题是以下操作失败:

wchar_t path[100] = _T("čaćšžđ\\test.txt");
FILE *f = _wfopen(path, L"w");

但是,如果您改写类似以下的字符串:

wchar_t path[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");

然后_wfopen调用成功并打开所需的文件.

首先,这与UTF-8完全无关.我假设您找到了使用char字符串并将其转换为wchar_t的解决方法,并且您以某种方式将其解释为涉及UTF-8或其他内容.

您使用哪种编码保存源代码?字符串L"čaćšžđ\\test.txt"是否实际上已正确保存?尝试关闭源文件并重新打开它.如果显示的某些字符替换为?,则问题的一部分是源文件编码.特别是Windows在北美和西欧大部分地区使用的默认编码:西欧(Windows)-代码页1252".

您还可以检查以下程序的输出:

#include <iomanip>
#include <iostream>

int main() {
    wchar_t path[16] = L"čaćšžđ\\test.txt";

    std::cout << std::hex << std::setfill('0');
    for (wchar_t wc : path) {
        std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
    }
    std::cout << '\n';
    wchar_t s[16] = L"\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt";

    for (wchar_t wc : s) {
        std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
    }
    std::cout << '\n';
}

您需要了解的另一件事是,书写字符的\uxxxx形式(称为通用字符名称或UCN)不是您可以在C ++中来回转换字符串的形式.在您编译该程序并使其运行时,即在您编写的任何代码可能试图生成包含\uxxxx的字符串时,编译器将UCN解释为不同字符的时间已经很久了.唯一有效的UCN是直接写在源文件中的UCN.


此外,您错误地使用了_T(). IMO根本不应该使用TCHAR和相关的宏,但是如果您确实使用它,则应该始终使用它:不要将TCHAR API与显式使用* W API或. TCHAR的全部要点是允许代码独立并在这些wchar_t和Microsoft的"ANSI" API之间进行切换,因此使用TCHAR然后硬编码假设TCHARwchar_t会破坏整个代码目的.

您应该写:

wchar_t path[100] = L"čaćšžđ\\test.txt";

I had a problem with opening UTF-8 path files. Path that has a UTF-8 char (like Cyrillic or Latin). I found a way to solve that with _wfopen but the way a solved it was when I encode the UTF-8 char with UTF by hand (\Uxxxx).

Is there a function, macro or anything that when I supply the string (path) it will return the Unicode??

Something like this: https://www.branah.com/unicode-converter

I tried with MultiByteToWideChar but it returns some Hex numbers that are not relavent.

Tried:

std::wstring s2ws(const std::string& s)
{
    int len;
    int slength = (int)s.length() + 1;
    len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
    wchar_t* buf = new wchar_t[len];
    MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
    std::wstring r(buf);
    delete[] buf;
    return r;
}
std::wstring stemp = s2ws(x);
LPCWSTR result = stemp.c_str();

The result I get: 0055F7E8

Thank you in advance

Update:

I installed boost, and now I am trying to do it with boost. Can some one maybe help me out with boost.

So I have a path: wchar_t path[100] = _T("čaćšžđ\\test.txt");

I need it converted to:

wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");

解决方案

Here's a way to convert between UTF-8 and UTF-16 on Windows, as well as showing the real values of the stored code units for both input and output:

#include <codecvt>
#include <iostream>
#include <iomanip>
#include <string>

int main() {
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;

    std::string s = "test";

    std::cout << std::hex << std::setfill('0');
    std::cout << "Input `char` data: ";
    for (char c : s) {
      std::cout << std::setw(2) << static_cast<unsigned>(static_cast<unsigned char>(c)) << ' ';
    }
    std::cout << '\n';

    std::wstring ws = convert.from_bytes(s);

    std::cout << "Output `wchar_t` data: ";
    for (wchar_t wc : ws) {
      std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
    }
    std::cout << '\n';
}

Understanding the real values of the input and output is important because otherwise you may not correctly understand the transformation that you really need. For example it looks to me like there may be some confusion as to how VC++ deals with encodings, and what \Uxxxxxxxx and \uxxxx actually do in C++ source code (e.g., they don't necessarily produce UTF-8 data).

Try using code like that shown above to see what your input data really is.


To emphasize what I've written above; there are strong indications that you may not correctly understand the processing that's being done on your input, and you need to thoroughly check it.

The above program does correctly transform the UTF-8 representation of ć (U+0107) into the single 16-bit code unit 0x0107, if you replace the test string with the following:

std::string s = "\xC4\x87"; // UTF-8 representation of U+0107

The output of the program, on Windows using Visual Studio, is then:

Input char data: c4 87
Output wchar_t data: 0107

This is in contrast to if you use test strings such as:

std::string s = "ć";

Or

std::string s = "\u0107";

Which may result in the following output:

Input char data: 3f
Output wchar_t data: 003f

The problem here is that Visual Studio does not use UTF-8 as the encoding for strings without some trickery, so your request to convert from UTF-8 probably isn't what you actually need; or you do need conversion from UTF-8, but you're testing potential conversion routines using input that differs from your real input.


So I have a path: wchar_t path[100] = _T("čaćšžđ\test.txt");

I need it converted to:

wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\test.txt");

Okay, so if I understand correctly, your actual problem is that the following fails:

wchar_t path[100] = _T("čaćšžđ\\test.txt");
FILE *f = _wfopen(path, L"w");

But if you instead write the string like:

wchar_t path[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");

Then the _wfopen call succeeds and opens the file you want.

First of all, this has absolutely nothing to do with UTF-8. I assume you found some workaround using a char string and converting that to wchar_t and you somehow interpreted this as involving UTF-8, or something.

What encoding are you saving the source code with? Is the string L"čaćšžđ\\test.txt" actually being saved properly? Try closing the source file and reopening it. If some characters show up replaced by ?, then part of your problem is the source file encoding. In particular this is true of the default encoding used by Windows in most of North America and Western Europe: "Western European (Windows) - Codepage 1252".

You can also check the output of the following program:

#include <iomanip>
#include <iostream>

int main() {
    wchar_t path[16] = L"čaćšžđ\\test.txt";

    std::cout << std::hex << std::setfill('0');
    for (wchar_t wc : path) {
        std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
    }
    std::cout << '\n';
    wchar_t s[16] = L"\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt";

    for (wchar_t wc : s) {
        std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
    }
    std::cout << '\n';
}

Another thing you need to understand is that the \uxxxx form of writing characters, called Universal Character Names or UCNs, is not a form that you can convert strings to and from in C++. By the time you've compiled the program and it's running, i.e. by the time any code you write could be attempting to produce strings containing \uxxxx, the time when UCNs are interpreted by the compiler as different characters is long past. The only UCNs that will work are ones that are written directly in the source file.


Also, you're using _T() incorrectly. IMO You shouldn't be using TCHAR and the related macros at all, but if you do use it then you ought to use it consistently: don't mix TCHAR APIs with explicit use of the *W APIs or wchar_t. The whole point of TCHAR is to allow code to be independent and switch between those wchar_t and Microsoft's "ANSI" APIs, so using TCHAR and then hard coding an assumption that TCHAR is wchar_t defeats the entire purpose.

You should just write:

wchar_t path[100] = L"čaćšžđ\\test.txt";

这篇关于用Unicode C ++编码路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆