试图获得libmecab.dll（MeCab）使用C＃ [英] Trying to get libmecab.dll (MeCab) to work with C#

查看：167 发布时间：2016/11/19 15:23:22 c# unicode character-encoding mecab
本文介绍了试图获得libmecab.dll（MeCab）使用C＃的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！
问题描述

我试图在C＃程序中使用日语形态分析器 MeCab （Visual Studio 2010 Express， Windows 7），和编码出了问题。如果我的输入（粘贴到文本框中）是这样的：
一方，広义的「ネココ」は，然后我的输出（在另一个文本框中）看起来像这样：
 
 
 
 
 $ p 
 $ b 
 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
（å，ã,μå¤‰æŽ¥ç¶š，*，*，*， *，* 
？åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？åè©ž，ã,μå¤‰æŽ¥ç¶ š，*，*，*，*，* 
？åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？åè©ž，ã ，μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
）åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
？ åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
 ??? åè©ž，ã,μå¤‰æŽ¥ç¶š，*，*，*，*，* 
 EOS 
 
我猜这是一些其他编码的文本被误认为UTF-8编码的文本。但假设它是EUC-JP并使用Encoding.Convert将其转换为UTF-8不会更改输出;假设它是Shift-JIS和做同样给不同的乱码。此外，虽然它绝对处理文本 - 这就是MeCab输出应该是格式化 - 它似乎不是将输入解释为UTF-8。如果是这样做，输出中将不会有一个字符化合物的所有相同的行，这显然无法识别。
 
 
 我当我通过MeCab的命令行运行句子时，获得另一个不同寻常的一套乱码。但是，它只是一行单独的问号和括号向左下方，所以这不仅是Windows命令行不支持带有日语字符的字体的问题;再次，它只是没有读取输入为UTF-8。 （我以UTF-8模式安装MeCab。）
 
 
 代码的相关部分如下：
 
 [DllImport（libmecab.dll，CallingConvention = CallingConvention.Cdecl）] 
 private extern static IntPtr mecab_new2（string arg）; 
 [dllImport（libmecab.dll，CallingConvention = CallingConvention.Cdecl）] 
 [return：MarshalAs（UnmanagedType.AnsiBStr）] 
 private extern static string mecab_sparse_tostr（IntPtr m，string str） ; 
 [DllImport（libmecab.dll，CallingConvention = CallingConvention.Cdecl）] 
 private extern static void mecab_destroy（IntPtr m）; 
 
私有字符串meCabParse（string jpnText）
 {
 IntPtr mecab = mecab_new2（）; 
 string parsedText = mecab_sparse_tostr（mecab，jpnText）; 
 
 mecab_destroy（mecab）; 
 return parsedText; 
} 
 
（对于看起来有问题的东西， UnmanagedType.AnsiBStr到UnmanagedType.BStr，它给出错误AccessViolationException未处理，并将CharSet = CharSet.Unicode添加到DllImport参数，这使输出只是EOS。）
 
 
 这是我一直在做的转换：
 
 // 65001 = UTF -8 codepage，20932 = EUC-JP codepage 
私有字符串convertEncoding（string sourceString，int sourceCodepage，int targetCodepage）
 {
 Encoding sourceEncoding = Encoding.GetEncoding（sourceCodepage）; 
编码targetEncoding = Encoding.GetEncoding（targetCodepage）; 
 
 //将源字符串转换为字节数组
 byte [] sourceBytes = sourceEncoding.GetBytes（sourceString）; 
 
 //将这些字节转换为目标编码
 byte [] targetBytes = Encoding.Convert（sourceEncoding，targetEncoding，sourceBytes）; 
 
 //字节数组到char数组
 char [] targetChars = new char [targetEncoding.GetCharCount（targetBytes，0，targetBytes.Length）]; 
 
 // char数组到targt编码的字符串
 targetEncoding.GetChars（targetBytes，0，targetBytes.Length，targetChars，0）; 
 string targetString = new string（targetChars）; 
 
 return targetString; 
} 
 
私有字符串meCabParse（string jpnText）
 {
 //将字符串中的文本从UTF-8转换为EUC-JP 
 jpnText = convertEncoding（jpnText，65001，20932）; 
 
 IntPtr mecab = mecab_new2（）; 
 string parsedText = mecab_sparse_tostr（mecab，jpnText）; 
 
 // annnd转换回UTF-8 
 parsedText = convertEncoding（parsedText，20932，65001）; 
 
 mecab_destroy（mecab）; 
} 
 
 
 
 建议/嘲讽？
解决方案
我遇到了这个线程寻找一种方法做同样的。我使用您的代码作为起点，并此博客帖子下面的代码给出了正确编码的输出：
 
 
 <$> p $ p>  public class Mecab 
 {
 [DllImport（libmecab.dll，CallingConvention = CallingConvention.Cdecl，CharSet = CharSet.Unicode）] 
 private extern static IntPtr mecab_new2（string arg）; 
 [DllImport（libmecab.dll，CallingConvention = CallingConvention.Cdecl，CharSet = CharSet.Unicode）] 
 private extern static IntPtr mecab_sparse_tostr（IntPtr m，byte [] str）; 
 [DllImport（libmecab.dll，CallingConvention = CallingConvention.Cdecl，CharSet = CharSet.Unicode）] 
 private extern static void mecab_destroy（IntPtr m）; 
 
 public static String Parse（String input）
 {
 IntPtr mecab = mecab_new2（）; 
 IntPtr nativeStr = mecab_sparse_tostr（mecab，Encoding.UTF8.GetBytes（input））; 
 int size = nativeArraySize（nativeStr） -  1; 
 byte [] data = new byte [size]; 
 Marshal.Copy（nativeStr，data，0，size）; 
 
 mecab_destroy（mecab）; 
 
 return Encoding.UTF8.GetString（data）; 
} 
 
 private static int nativeArraySize（IntPtr ptr）
 {
 int size = 0; 
 while（Marshal.ReadByte（ptr，size）> 0）
 size ++; 
 
 return size; 
} 
} 
  
 
I'm trying to use the Japanese morphological analyzer MeCab in a C# program (Visual Studio 2010 Express, Windows 7), and something's going wrong with the encoding. If my input (pasted into a textbox) is this:
一方、広義の「ネコ」は、ネコ類（ネコ科動物）の一部、あるいはその全ての獣を指す包括的名称を指す。
Then my output (in another textbox) looks like this:
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
(   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
)   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
?????????????????????????   åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
EOS
I would guess that that's text in some other encoding being mistaken for UTF-8-encoded text. But assuming that it's EUC-JP and using Encoding.Convert to turn it into UTF-8 doesn't change the output; assuming that it's Shift-JIS and doing the same gives different gibberish. Also, while it's definitely processing the text - that's how MeCab output is supposed to be formatted - it doesn't appear to be interpreting the input as UTF-8, either. If it were doing so, there wouldn't be all those identical lines in the output starting with one-character "compounds," which it's clearly unable to identify.

I get yet another different-looking set of gibberish when I run the sentence through MeCab's command line. But, again, it's just a row of single question marks and parentheses going down the left, so it's not just the problem that the Windows command line doesn't support fonts with Japanese characters; again, it's just not reading the input in as UTF-8. (I did install MeCab in UTF-8 mode.)

The relevant parts of the code look like this:
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
private extern static IntPtr mecab_new2(string arg);
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
[return: MarshalAs(UnmanagedType.AnsiBStr)]
private extern static string mecab_sparse_tostr(IntPtr m, string str);
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
private extern static void mecab_destroy(IntPtr m);

private string meCabParse(string jpnText)
{
    IntPtr mecab = mecab_new2("");
    string parsedText = mecab_sparse_tostr(mecab, jpnText);

    mecab_destroy(mecab);
    return parsedText;
}
(In terms of fiddling with plausible-looking things to see if they make a difference, I've tried switching "UnmanagedType.AnsiBStr" to "UnmanagedType.BStr," which gives the error "AccessViolationException was unhandled," and adding "CharSet=CharSet.Unicode" to the DllImport parameters, which turned the output into just "EOS".)

This is how I've been doing the conversion:
// 65001 = UTF-8 codepage, 20932 = EUC-JP codepage
private string convertEncoding(string sourceString, int sourceCodepage, int targetCodepage)
{
    Encoding sourceEncoding = Encoding.GetEncoding(sourceCodepage); 
    Encoding targetEncoding = Encoding.GetEncoding(targetCodepage);

    // convert source string into byte array
    byte[] sourceBytes = sourceEncoding.GetBytes(sourceString);

    // convert those bytes into target encoding
    byte[] targetBytes = Encoding.Convert(sourceEncoding, targetEncoding, sourceBytes);

    // byte array to char array
    char[] targetChars = new char[targetEncoding.GetCharCount(targetBytes, 0, targetBytes.Length)];

    //char array to targt-encoded string
    targetEncoding.GetChars(targetBytes, 0, targetBytes.Length, targetChars, 0);
    string targetString = new string(targetChars);

    return targetString;
}

private string meCabParse(string jpnText)
{
    // convert the text from the string from UTF-8 to EUC-JP
    jpnText = convertEncoding(jpnText, 65001, 20932);

    IntPtr mecab = mecab_new2("");
    string parsedText = mecab_sparse_tostr(mecab, jpnText);

    // annnd convert back to UTF-8
    parsedText = convertEncoding(parsedText, 20932, 65001);

    mecab_destroy(mecab);
}
Suggestions/taunts?
 解决方案 
I came across this thread looking for a way to do the same. I used your code as a starting point and this blog post for figuring out how to marshal UTF8 strings.

The following code gives me properly encoded output:
public class Mecab
{
    [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet=CharSet.Unicode)]
    private extern static IntPtr mecab_new2(string arg);
    [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
    private extern static IntPtr mecab_sparse_tostr(IntPtr m, byte[] str);
    [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
    private extern static void mecab_destroy(IntPtr m);

    public static String Parse(String input)
    {
        IntPtr mecab = mecab_new2("");
        IntPtr nativeStr = mecab_sparse_tostr(mecab, Encoding.UTF8.GetBytes(input));
        int size = nativeArraySize(nativeStr) - 1;
        byte[] data = new byte[size];
        Marshal.Copy(nativeStr, data, 0, size);

        mecab_destroy(mecab);

        return Encoding.UTF8.GetString(data);
    }

    private static int nativeArraySize(IntPtr ptr)
    {
        int size = 0;
        while (Marshal.ReadByte(ptr, size) > 0)
            size++;

        return size;
    }
}


                        
这篇关于试图获得libmecab.dll（MeCab）使用C＃的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！
查看全文
试图获得libmecab.dll（MeCab）使用C＃ [英] Trying to get libmecab.dll (MeCab) to work with C#

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

试图获得libmecab.dll（MeCab）使用C＃ [英] Trying to get libmecab.dll (MeCab) to work with C#

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭