从C访问MATLAB的unicode字符串 [英] Accessing MATLAB's unicode strings from C

查看:112
本文介绍了从C访问MATLAB的unicode字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何通过MATLAB Engine或MEX C接口访问MATLAB字符串的基础unicode数据?

How can I access the underlying unicode data of MATLAB strings through the MATLAB Engine or MEX C interfaces?

这是一个例子.让我们将Unicode字符放入UTF-8编码的文件test.txt中,然后将其读取为

Here's an example. Let's put unicode characters in a UTF-8 encoded file test.txt, then read it as

fid=fopen('test.txt','r','l','UTF-8');
s=fscanf(fid, '%s')

在MATLAB中.

现在,如果我先执行feature('DefaultCharacterSet', 'UTF-8'),然后从C engEvalString(ep, "s"),然后作为输出,我从文件中以UTF-8取回文本.这证明了MATLAB在内部将其存储为unicode.但是,如果我做mxArrayToString(engGetVariable(ep, "s")),我会得到unicode2native(s, 'Latin-1')在MATLAB中会给我的东西:用字符代码26替换所有非拉丁1字符.我需要的是在任何Unicode格式(UTF-8,UTF-16等),并保留非拉丁1字符. 这可能吗?

Now if I first do feature('DefaultCharacterSet', 'UTF-8'), then from C engEvalString(ep, "s"), then as output I get back the text from the file as UTF-8. This proves that MATLAB stores it as unicode internally. However if I do mxArrayToString(engGetVariable(ep, "s")), I get what unicode2native(s, 'Latin-1') would give me in MATLAB: all non-Latin-1 characters replaced by character code 26. What I need is getting access to the underlying unicode data as a C string in any unicode format (UTF-8, UTF-16, etc.), and preserving the non-Latin-1 characters. Is this possible?

我的平台是OS X,MATLAB R2012b.

My platform is OS X, MATLAB R2012b.

附录:文档明确声明"[mxArrayToString() ]支持多字节编码字符",但它仍然只为我提供了原始数据的Latin-1近似值.

Addendum: The documentation explicitly states that "[mxArrayToString()] supports multibyte encoded characters", yet it still gives me only a Latin-1 approximation to the original data.

推荐答案

首先,让我分享一些我在网上找到的参考文献:

First, let me share a few references I found online:

MATLAB在具有以下功能的计算机上将字符存储为2字节Unicode字符 多字节字符集

MATLAB stores characters as 2-byte Unicode characters on machines with multi-byte character sets

对于我而言,MBCS这个术语仍然有些含糊不清,我认为他们在这种情况下表示UTF-16(尽管我不确定代理对,可能改为使用UCS-2.

Still the term MBCS is somewhat ambiguous to me, I think they meant UTF-16 in this context (although I'm not sure about surrogate pairs, which probably makes it UCS-2 instead).

更新: MathWorks将措辞更改为:

UPDATE: MathWorks changed the wording to:

MATLAB对Unicode字符使用16位无符号整数字符编码.

MATLAB uses 16-bit unsigned integer character encoding for Unicode characters.

  • mxArrayToString 页指出确实处理多字节编码的字符(取消链接 mxGetString 仅处理单个字符字节编码方案).不幸的是,没有关于如何执行此操作的示例.

  • The mxArrayToString page states that it does handle multibyte encoded characters (unlinke mxGetString which only handles single-byte encoding schemes). Unfortunately, no example on how to do this.

    最后,这是MATLAB新闻组上的 thread 其中提到了与此相关的两个未公开记录的功能(您可以通过将libmx.dll库加载到Windows中的 Dependency Walker 之类的工具中来自己找到这些功能).

    Finally, here is a thread on the MATLAB newsgroup which mentions a couple of undocumented function that are related to this (you can find those yourself by loading the libmx.dll library into a tool like Dependency Walker on Windows).

    这是我在MEX中做的一个小实验:

    Here's a small experiment I did in MEX:

    #include "mex.h"
    
    void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
    {
        char str_ascii[] = {0x41, 0x6D, 0x72, 0x6F, 0x00};   // {'A','m','r','o',0}
        char str_utf8[] = {
            0x41,                   // U+0041
            0xC3, 0x80,             // U+00C0
            0xE6, 0xB0, 0xB4,       // U+6C34
            0x00
        };
        char str_utf16_le[] = {
            0x41, 0x00,             // U+0041
            0xC0, 0x00,             // U+00C0
            0x34, 0x6C,             // U+6C34
            0x00, 0x00
        };
    
        plhs[0] = mxCreateString(str_ascii);
        plhs[1] = mxCreateString_UTF8(str_utf8);        // undocumented!
        plhs[2] = mxCreateString_UTF16(str_utf16_le);   // undocumented!
    }
    

    我用分别用ASCII,UTF-8和UTF-16LE编码的C代码创建了三个字符串.然后,我使用mxCreateString MEX函数(以及它的其他未公开版本)将它们传递给MATLAB.

    I create three strings in C code encoded with ASCII, UTF-8, and UTF-16LE respectively. I then pass them to MATLAB using the mxCreateString MEX function (and other undocumented versions of it).

    我通过访问 Fileformat.info 网站获得了字节序列: A(U + 0041)

    I got the byte sequences by consulting Fileformat.info website: A (U+0041), À (U+00C0), and 水 (U+6C34).

    让我们在MATLAB中测试上述功能:

    Let's test the above function inside MATLAB:

    %# call the MEX function
    [str_ascii, str_utf8, str_utf16_le] = my_func()
    
    %# MATLAB exposes the two strings in a decoded form (Unicode code points)
    double(str_utf8)       %# decimal form: [65, 192, 27700]
    assert(isequal(str_utf8, str_utf16_le))
    
    %# convert them to bytes (in HEX)
    b1 = unicode2native(str_utf8, 'UTF-8')
    b2 = unicode2native(str_utf16_le, 'UTF-16')
    cellstr(dec2hex(b1))'  %# {'41','C3','80','E6','B0','B4'}
    cellstr(dec2hex(b2))'  %# {'FF','FE','41','00','C0','00','34','6C'}
                           %# (note that first two bytes are BOM markers)
    
    %# show string
    view_unicode_string(str_utf8)
    

    我正在使用嵌入式Java功能来查看字符串:

    I am making use of the embedded Java capability to view the strings:

    function view_unicode_string(str)
        %# create Swing JLabel
        jlabel = javaObjectEDT('javax.swing.JLabel', str);
        font = java.awt.Font('Arial Unicode MS', java.awt.Font.PLAIN, 72);
        jlabel.setFont(font);
        jlabel.setHorizontalAlignment(javax.swing.SwingConstants.CENTER);
    
        %# place Java component inside a MATLAB figure
        hfig = figure('Menubar','none');
        [~,jlabelHG] = javacomponent(jlabel, [], hfig);
        set(jlabelHG, 'Units','normalized', 'Position',[0 0 1 1])
    end
    


    现在让我们往相反的方向工作(接受从MATLAB到C的字符串):


    Now let's work in the reverse direction (accepting a string from MATLAB into C):

    #include "mex.h"
    
    void print_hex(const unsigned char* s, size_t len)
    {
        size_t i;
        for(i=0; i<len; ++i) {
            mexPrintf("0x%02X ", s[i] & 0xFF);
        }
        mexPrintf("0x00\n");
    }
    
    void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
    {
        char *str;
        if (nrhs<1 || !mxIsChar(prhs[0])) {
            mexErrMsgIdAndTxt("mex:error", "Expecting a string");
        }
        str = mxArrayToString_UTF8(prhs[0]); // get UTF-8 encoded string from Unicode
        print_hex(str, strlen(str));         // print bytes
        plhs[0] = mxCreateString_UTF8(str);  // create Unicode string from UTF-8
        mxFree(str);
    }
    

    然后我们在MATLAB内部进行测试:

    And we test this from inside MATLAB:

    >> s = char(hex2dec(['0041';'00C0';'6C34'])');   %# "\u0041\u00C0\u6C34"
    >> ss = my_func_reverse(s);
    0x41 0xC3 0x80 0xE6 0xB0 0xB4 0x00               %# UTF-8 encoding
    >> assert(isequal(s,ss))
    


    最后我应该说,如果由于某种原因您仍然遇到问题, 最简单的方法是将非ASCII字符串转换为uint8数据类型 在将其从MATLAB传递到引擎程序之前.


    Finally I should say that if for some reason you are still having problems, the easiest thing would be to convert the non-ASCII strings to uint8 datatype before passing this from MATLAB to your engine program.

    因此在MATLAB流程中执行以下操作:

    So inside the MATLAB process do:

    %# read contents of a UTF-8 file
    fid = fopen('test.txt', 'rb', 'native', 'UTF-8');
    str = fread(fid, '*char')';
    fclose(fid);
    str_bytes = unicode2native(str,'UTF-8');  %# convert to bytes
    
    %# or simply read the file contents as bytes to begin with
    %fid = fopen('test.txt', 'rb');
    %str_bytes = fread(fid, '*uint8')';
    %fclose(fid);
    

    并使用以下引擎API访问变量:

    and access the variable using the Engine API as:

    mxArray *arr = engGetVariable(ep, "str_bytes");
    uint8_T *bytes = (uint8_T*) mxGetData(arr);
    // now you decode this utf-8 string on your end ...
    


    所有测试都是在运行R2012b且使用默认字符集的WinXP上完成的:


    All tests were done on WinXP running R2012b with the default charset:

    >> feature('DefaultCharacterSet')
    ans =
    windows-1252
    

    希望这会有所帮助.

    在MATLAB R2014a中,从libmx库中删除了许多未记录的 C函数(包括上面使用的函数),并替换为在名称空间matrix::detail::noninlined::mx_array_api下公开的等效C ++函数.

    In MATLAB R2014a, many undocumented C functions were removed from libmx library (including the ones used above), and replaced with equivalent C++ functions exposed under the namespace matrix::detail::noninlined::mx_array_api.

    调整上面的示例应该很容易(如此处)以在最新的R2014a版本上运行.

    It should be easy to adjust the examples above (as explained here) to run on the latest R2014a version.

    这篇关于从C访问MATLAB的unicode字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆