MarshalAs(UnmanagedType.LPStr) - 如何将utf-8字符串转换为char * [英] MarshalAs(UnmanagedType.LPStr) - how does this convert utf-8 strings to char*

查看:1207
本文介绍了MarshalAs(UnmanagedType.LPStr) - 如何将utf-8字符串转换为char *的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题标题基本上是我想问:



[MarshalAs(UnmanagedType.LPStr)] - 如何将utf-8字符串转换为char *?



当我尝试在c#和c ++ dll之间通信时,我使用上面的行;
更具体地,介于:



somefunction(char * string)[c ++ dll]



([MarshalAs(UnmanagedType.LPStr)string text)[c#]



当我通过c#发送utf-8文本(scintilla.Text)到我的c ++ dll,
我在VS 10调试器中显示:


  1. c#字符串已成功转换为char * / p>


  2. 生成的char *会在观察窗口中正确反映相应的utf-8字符(包括韩语位)。


这是屏幕截图(更多详细信息):





如您所见,initialScriptText [0]返回单个字节(char):'B'和char * initialScriptText的内容在VS监视窗口中正确显示(包括韩语)。



,似乎英语被保存为每个字符一个字节,而韩语似乎被保存为每个字符两个字节。 (屏幕截图中的韩语单词为3个字母,因此保存为6个字节)



这似乎表明每个字母不保存在大小相同的容器中,但根据语言不同而不同。 (可能的类型提示?)



我想在纯c ++中获得相同的结果:读入utf-8文件并将结果保存为char *。



这是我尝试读取utf-8文件并转换为c *中的char *的示例:





观察: / p>


  1. 从wchar_t *转换为char *时,视觉上的损失

  2. 正确地,我知道我已经将wchar_t *中的utf-8文件内容转换为char *

  3. ,因为'result'保留了我直接从文件中获得的字节,我得到一个不同的结果从我通过c#(我使用了相同的文件),我的结论是,c#marshal已通过一些其他过程文件内容进一步突变的文本到char *。

(屏幕截图还显示了我使用wcstombs的可怕失败)



注意:我使用的是utf8头(http://utfcpp.sourceforge.net/)



请纠正我对我的代码/观察中的任何错误。 / p>

我想能够模仿我通过c#元帅的结果,而且我已经意识到,在完成所有这些,我完全卡住了。任何想法?

解决方案


[MarshalAs(UnmanagedType.LPStr) 8个字符串到char *?


在托管代码中没有诸如utf-8字符串这样的东西,字符串总是以utf-16编码的。使用默认系统代码页完成从和到LPStr的编组。



如果使用utf-8的interop是一个很难的要求,那么你需要使用代码页949。在pinvoke声明中使用byte []。并用System.Text.Encoding.UTF8自行转换。使用它的GetString()方法将byte []转换为字符串,使用GetBytes()方法将字符串转换为byte []。如果可能,请在本机代码中使用wchar_t []避免这一切。


The question title is basically what I'd like to ask:

[MarshalAs(UnmanagedType.LPStr)] - how does this convert utf-8 strings to char* ?

I use the above line when I attempt to communicate between c# and c++ dlls; more specifically, between:

somefunction(char *string) [c++ dll]

somefunction([MarshalAs(UnmanagedType.LPStr) string text) [c#]

When I send my utf-8 text (scintilla.Text) through c# and into my c++ dll, I'm shown in my VS 10 debugger that:

  1. the c# string was successfully converted to char*

  2. the resulting char * properly reflects the corresponding utf-8 chars (including the bit in Korean) in the watch window.

Here's a screenshot (with more details):

As you can see, initialScriptText[0] returns the single byte(char): 'B' and the contents of char * initialScriptText are displayed properly (including Korean) in the VS watch window.

Going through the char pointer, it seems that English is saved as one byte per char, while Korean seems to be saved as two bytes per char. (the Korean word in the screenshot is 3 letters, hence saved in 6 bytes)

This seems to show that each 'letter' isn't saved in equal size containers, but differs depending on language. (possible hint on type?)

I'm trying to achieve the same result in pure c++: reading in utf-8 files and saving the result as char *.

Here's an example of my attempt to read a utf-8 file and convert to char * in c++:

observations:

  1. loss in visual when converting from wchar_t* to char*
  2. since result,s8 displays the string properly, I know I've converted the utf-8 file content in wchar_t* successfully to char *
  3. since 'result' retains the bytes I've taken directly from the file, but I'm getting a different result from what I had through c# (I've used the same file), I've concluded that the c# marshal has put the file contents through some other procedure to further mutate the text to char *.

(the screenshot also shows my terrible failure in using wcstombs)

note: I'm using the utf8 header from (http://utfcpp.sourceforge.net/)

Please correct me on any mistakes in my code/observations.

I'd like to be able to mimic the result I'm getting through the c# marshal and I've realised after going through all this that I'm completely stuck. Any ideas?

解决方案

[MarshalAs(UnmanagedType.LPStr)] - how does this convert utf-8 strings to char* ?

It doesn't. There is no such thing as a "utf-8 string" in managed code, strings are always encoded in utf-16. The marshaling from and to an LPStr is done with the default system code page. Which makes it fairly remarkable that you see Korean glyphs in the debugger, unless you use code page 949.

If interop with utf-8 is a hard requirement then you need to use a byte[] in the pinvoke declaration. And convert back and forth yourself with System.Text.Encoding.UTF8. Use its GetString() method to convert the byte[] to a string, its GetBytes() method to convert a string to byte[]. Avoid all this if possible by using wchar_t[] in the native code.

这篇关于MarshalAs(UnmanagedType.LPStr) - 如何将utf-8字符串转换为char *的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆