从C访问MATLAB的unicode字符串 [英] Accessing MATLAB's unicode strings from C

查看：214 发布时间：2017/8/16 22:03:26 matlab unicode encoding mex matlab-engine

本文介绍了从C访问MATLAB的unicode字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何通过MATLAB引擎或MEX C接口访问MATLAB字符串的底层unicode数据？

这是一个例子。我们把unicode字符放在一个UTF-8编码的文件test.txt中，然后将它读作

  fid = fopen（'test。 TXT， 'R'， 'L'， 'UTF-8'）; 
s = fscanf（fid，'％s'）

在MATLAB中。

现在，如果我首先执行功能（'DefaultCharacterSet'，'UTF-8'），那么从C engEvalString（ep，s），然后作为输出，从文件返回文本为UTF-8。这证明MATLAB在内部将其存储为unicode。但是，如果我做了 mxArrayToString（engGetVariable（ep，s）），我得到什么 unicode2native（s，'Latin-1'）将在MATLAB中给我：所有非拉丁字符由字符代码26替换。我需要的是以任何unicode格式（UTF-8，UTF- 16等），并保留非拉丁字符。

 
 
 我的平台是OS X，MATLAB R2012b。
 
 
 附录：查看字符串：
  function view_unicode_string（str）
％＃create Swing JLabel 
 jlabel = javaObjectEDT 'javax.swing.JLabel'，str）; 
 font = java.awt.Font（'Arial Unicode MS'，java.awt.Font.PLAIN，72）; 
 jlabel.setFont（font）; 
 jlabel.setHorizontalAlignment（javax.swing.SwingConstants.CENTER）; 
 
％＃将Java组件放在MATLAB图中
 hfig = figure（'Menubar'，'none'）; 
 [〜，jlabelHG] = javacomponent（jlabel，[]，hfig）; 
 set（jlabelHG，'Units'，'normalized'，'Position'，[0 0 1 1]）$ b $ b end 
  
 
 
 
 
 
 现在让我们的方向相反（从MATLAB接受一个字符串到C）：
 
 
  my_func_reverse.c 
 
 
 
  #includemex.h
 
 void print_hex unsigned char * s，size_t len）
 {
 size_t i; （i = 0; i  mexPrintf（0x％02X，s [i]& 0xFF））
 
} 
 mexPrintf（0x00\\\
）; 
} 
 
 void mexFunction（int nlhs，mxArray * plhs []，int nrhs，const mxArray * prhs []）
 {
 char * str; 
 if（nrhs< 1 ||！mxIsChar（prhs [0]））{
 mexErrMsgIdAndTxt（mex：error，Expecting a string）; 
} 
 str = mxArrayToString_UTF8（prhs [0]）; //从Unicode 
 print_hex（str，strlen（str））获取UTF-8编码的字符串; // print bytes 
 plhs [0] = mxCreateString_UTF8（str）; //从UTF-8 
 mxFree（str）创建Unicode字符串; 
} 
  
我们从MATLAB内部测试：
 >> s = char（hex2dec（['0041';'00C0';'6C34']））'）; ％＃\\\A\\\À\\\水
>> ss = my_func_reverse（s）; 
 0x41 0xC3 0x80 0xE6 0xB0 0xB4 0x00％＃UTF-8编码
>> assert（isequal（s，ss））
  
 
 
 
 
 
 最后我应该说，如果由于某些原因你仍然有问题，
最简单的事情是将非ASCII字符串转换为 uint8  datatype 
之前将其从MATLAB传递给您的引擎程序。
 
 
 所以在MATLAB进程中执行：
 ％＃读取UTF-8文件的内容
 fid = fopen（'test.txt'，'rb'，'native'，'UTF-8'）; 
 str = fread（fid，'* char'）'; 
 fclose（fid）; 
 str_bytes = unicode2native（str，'UTF-8'）; ％＃转换为字节
 
％＃或者只是以
％fid = fopen（'test.txt'，'rb'）的形式读取文件内容作为字节。 
％str_bytes = fread（fid，'* uint8'）'; 
％fclose（fid）; 
  
并使用Engine API访问变量：
  mxArray * arr = engGetVariable（ep，str_bytes）; 
 uint8_T * bytes =（uint8_T *）mxGetData（arr）; 
 //现在你解码这个utf-8字符串在你的最后... 
  
 
 
 
 
 
 所有测试都是使用默认字符集运行R2012b的WinXP进行的：
 >> ;功能（'DefaultCharacterSet'）
 ans = 
 windows-1252 
  
希望这个帮助.. 
 
 
 
 
 
 编辑：
 
 
 在MATLAB R2014a中，许多未记录的 C函数从 libmx 库（包括上面使用的）中删除，并替换为在命名空间<$ c下公开的等效C ++函数$ c> matrix :: detail :: noninlined :: mx_array_api 。
 
 
 应该很容易调整上面的例子（如 here ）运行在最新的R2014a版本。
 
How can I access the underlying unicode data of MATLAB strings through the MATLAB Engine or MEX C interfaces?

Here's an example. Let's put unicode characters in a UTF-8 encoded file test.txt, then read it as
fid=fopen('test.txt','r','l','UTF-8');
s=fscanf(fid, '%s')
in MATLAB.

Now if I first do feature('DefaultCharacterSet', 'UTF-8'), then from C engEvalString(ep, "s"), then as output I get back the text from the file as UTF-8.  This proves that MATLAB stores it as unicode internally. However if I do mxArrayToString(engGetVariable(ep, "s")), I get what unicode2native(s, 'Latin-1') would give me in MATLAB: all non-Latin-1 characters replaced by character code 26.  What I need is getting access to the underlying unicode data as a C string in any unicode format (UTF-8, UTF-16, etc.), and preserving the non-Latin-1 characters. Is this possible?

My platform is OS X, MATLAB R2012b.

Addendum:  The documentation explicitly states that "[mxArrayToString()] supports multibyte encoded characters", yet it still gives me only a Latin-1 approximation to the original data.
 解决方案 
First, let me share a few references I found online:


According to mxChar description,

  MATLAB stores characters as 2-byte Unicode characters on machines with
  multi-byte character sets
Still the term MBCS is somewhat ambiguous to me, I think they meant UTF-16 in this context (although I'm not sure about surrogate pairs, which probably makes it UCS-2 instead).

UPDATE: MathWorks changed the wording to:

  MATLAB uses 16-bit unsigned integer character encoding for Unicode characters.

The mxArrayToString page states that it does handle multibyte encoded characters (unlinke mxGetString which only handles single-byte encoding schemes). Unfortunately, no example on how to do this.
Finally, here is a thread on the MATLAB newsgroup which mentions a couple of undocumented function that are related to this (you can find those yourself by loading the libmx.dll library into a tool like Dependency Walker on Windows).




Here's a small experiment I did in MEX:

my_func.c

#include "mex.h"

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    char str_ascii[] = {0x41, 0x6D, 0x72, 0x6F, 0x00};   // {'A','m','r','o',0}
    char str_utf8[] = {
        0x41,                   // U+0041
        0xC3, 0x80,             // U+00C0
        0xE6, 0xB0, 0xB4,       // U+6C34
        0x00
    };
    char str_utf16_le[] = {
        0x41, 0x00,             // U+0041
        0xC0, 0x00,             // U+00C0
        0x34, 0x6C,             // U+6C34
        0x00, 0x00
    };

    plhs[0] = mxCreateString(str_ascii);
    plhs[1] = mxCreateString_UTF8(str_utf8);        // undocumented!
    plhs[2] = mxCreateString_UTF16(str_utf16_le);   // undocumented!
}
I create three strings in C code encoded with ASCII, UTF-8, and UTF-16LE respectively. I then pass them to MATLAB using the mxCreateString MEX function (and other undocumented versions of it).

I got the byte sequences by consulting Fileformat.info website:
A (U+0041), À (U+00C0), and 水 (U+6C34).

Let's test the above function inside MATLAB:
%# call the MEX function
[str_ascii, str_utf8, str_utf16_le] = my_func()

%# MATLAB exposes the two strings in a decoded form (Unicode code points)
double(str_utf8)       %# decimal form: [65, 192, 27700]
assert(isequal(str_utf8, str_utf16_le))

%# convert them to bytes (in HEX)
b1 = unicode2native(str_utf8, 'UTF-8')
b2 = unicode2native(str_utf16_le, 'UTF-16')
cellstr(dec2hex(b1))'  %# {'41','C3','80','E6','B0','B4'}
cellstr(dec2hex(b2))'  %# {'FF','FE','41','00','C0','00','34','6C'}
                       %# (note that first two bytes are BOM markers)

%# show string
view_unicode_string(str_utf8)


I am making use of the embedded Java capability to view the strings:
function view_unicode_string(str)
    %# create Swing JLabel
    jlabel = javaObjectEDT('javax.swing.JLabel', str);
    font = java.awt.Font('Arial Unicode MS', java.awt.Font.PLAIN, 72);
    jlabel.setFont(font);
    jlabel.setHorizontalAlignment(javax.swing.SwingConstants.CENTER);

    %# place Java component inside a MATLAB figure
    hfig = figure('Menubar','none');
    [~,jlabelHG] = javacomponent(jlabel, [], hfig);
    set(jlabelHG, 'Units','normalized', 'Position',[0 0 1 1])
end




Now let's work in the reverse direction (accepting a string from MATLAB into C):

my_func_reverse.c

#include "mex.h"

void print_hex(const unsigned char* s, size_t len)
{
    size_t i;
    for(i=0; i<len; ++i) {
        mexPrintf("0x%02X ", s[i] & 0xFF);
    }
    mexPrintf("0x00\n");
}

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    char *str;
    if (nrhs<1 || !mxIsChar(prhs[0])) {
        mexErrMsgIdAndTxt("mex:error", "Expecting a string");
    }
    str = mxArrayToString_UTF8(prhs[0]); // get UTF-8 encoded string from Unicode
    print_hex(str, strlen(str));         // print bytes
    plhs[0] = mxCreateString_UTF8(str);  // create Unicode string from UTF-8
    mxFree(str);
}
And we test this from inside MATLAB:
>> s = char(hex2dec(['0041';'00C0';'6C34'])');   %# "\u0041\u00C0\u6C34"
>> ss = my_func_reverse(s);
0x41 0xC3 0x80 0xE6 0xB0 0xB4 0x00               %# UTF-8 encoding
>> assert(isequal(s,ss))




Finally I should say that if for some reason you are still having problems,
the easiest thing would be to convert the non-ASCII strings to uint8 datatype
before passing this from MATLAB to your engine program.

So inside the MATLAB process do:
%# read contents of a UTF-8 file
fid = fopen('test.txt', 'rb', 'native', 'UTF-8');
str = fread(fid, '*char')';
fclose(fid);
str_bytes = unicode2native(str,'UTF-8');  %# convert to bytes

%# or simply read the file contents as bytes to begin with
%fid = fopen('test.txt', 'rb');
%str_bytes = fread(fid, '*uint8')';
%fclose(fid);
and access the variable using the Engine API as:
mxArray *arr = engGetVariable(ep, "str_bytes");
uint8_T *bytes = (uint8_T*) mxGetData(arr);
// now you decode this utf-8 string on your end ...




All tests were done on WinXP running R2012b with the default charset:
>> feature('DefaultCharacterSet')
ans =
windows-1252
Hope this helps..



EDIT:

In MATLAB R2014a, many undocumented C functions were removed from libmx library (including the ones used above), and replaced with equivalent C++ functions exposed under the namespace matrix::detail::noninlined::mx_array_api.

It should be easy to adjust the examples above (as explained here) to run on the latest R2014a version.

                        这篇关于从C访问MATLAB的unicode字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

从C访问MATLAB的unicode字符串 [英] Accessing MATLAB's unicode strings from C

问题描述

my_func_reverse.c

编辑：

my_func.c

my_func_reverse.c

EDIT:

相关文章

开发方法最新文章

热门教程

热门工具

登录关闭

从C访问MATLAB的unicode字符串 [英] Accessing MATLAB&#39;s unicode strings from C

问题描述

my_func_reverse.c

编辑：

my_func.c

my_func_reverse.c

EDIT:

相关文章

开发方法最新文章

热门教程

热门工具

登录 关闭

从C访问MATLAB的unicode字符串 [英] Accessing MATLAB's unicode strings from C

登录关闭