试图获得libmecab.dll(MeCab)使用C# [英] Trying to get libmecab.dll (MeCab) to work with C#

查看:167
本文介绍了试图获得libmecab.dll(MeCab)使用C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在C#程序中使用日语形态分析器 MeCab (Visual Studio 2010 Express, Windows 7),和编码出了问题。如果我的输入(粘贴到文本框中)是这样的:

一方,広义的「ネココ」は,然后我的输出(在另一个文本框中)看起来像这样:





$ p
$ b

 
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
(å,ã,μ変接続,*,*,*, *,*
?åè©ž,ã,μ変接続,*,*,*,*,*
?åè©ž,ã,μå¤‰æŽ¥ç¶ š,*,*,*,*,*
?åè©ž,ã,μ変接続,*,*,*,*,*
?åè©ž,ã ,μ変接続,*,*,*,*,*
?åè©ž,ã,μ変接続,*,*,*,*,*
)åè©ž,ã,μ変接続,*,*,*,*,*
? åè©ž,ã,μ変接続,*,*,*,*,*
??? åè©ž,ã,μ変接続,*,*,*,*,*
EOS

我猜这是一些其他编码的文本被误认为UTF-8编码的文本。但假设它是EUC-JP并使用Encoding.Convert将其转换为UTF-8不会更改输出;假设它是Shift-JIS和做同样给不同的乱码。此外,虽然它绝对处理文本 - 这就是MeCab输出应该是格式化 - 它似乎不是将输入解释为UTF-8。如果是这样做,输出中将不会有一个字符化合物的所有相同的行,这显然无法识别。



我当我通过MeCab的命令行运行句子时,获得另一个不同寻常的一套乱码。但是,它只是一行单独的问号和括号向左下方,所以这不仅是Windows命令行不支持带有日语字符的字体的问题;再次,它只是没有读取输入为UTF-8。 (我以UTF-8模式安装MeCab。)



代码的相关部分如下:

 
[DllImport(libmecab.dll,CallingConvention = CallingConvention.Cdecl)]
private extern static IntPtr mecab_new2(string arg);
[dllImport(libmecab.dll,CallingConvention = CallingConvention.Cdecl)]
[return:MarshalAs(UnmanagedType.AnsiBStr)]
private extern static string mecab_sparse_tostr(IntPtr m,string str) ;
[DllImport(libmecab.dll,CallingConvention = CallingConvention.Cdecl)]
private extern static void mecab_destroy(IntPtr m);

私有字符串meCabParse(string jpnText)
{
IntPtr mecab = mecab_new2();
string parsedText = mecab_sparse_tostr(mecab,jpnText);

mecab_destroy(mecab);
return parsedText;
}

(对于看起来有问题的东西, UnmanagedType.AnsiBStr到UnmanagedType.BStr,它给出错误AccessViolationException未处理,并将CharSet = CharSet.Unicode添加到DllImport参数,这使输出只是EOS。)



这是我一直在做的转换:

 
// 65001 = UTF -8 codepage,20932 = EUC-JP codepage
私有字符串convertEncoding(string sourceString,int sourceCodepage,int targetCodepage)
{
Encoding sourceEncoding = Encoding.GetEncoding(sourceCodepage);
编码targetEncoding = Encoding.GetEncoding(targetCodepage);

//将源字符串转换为字节数组
byte [] sourceBytes = sourceEncoding.GetBytes(sourceString);

//将这些字节转换为目标编码
byte [] targetBytes = Encoding.Convert(sourceEncoding,targetEncoding,sourceBytes);

//字节数组到char数组
char [] targetChars = new char [targetEncoding.GetCharCount(targetBytes,0,targetBytes.Length)];

// char数组到targt编码的字符串
targetEncoding.GetChars(targetBytes,0,targetBytes.Length,targetChars,0);
string targetString = new string(targetChars);

return targetString;
}

私有字符串meCabParse(string jpnText)
{
//将字符串中的文本从UTF-8转换为EUC-JP
jpnText = convertEncoding(jpnText,65001,20932);

IntPtr mecab = mecab_new2();
string parsedText = mecab_sparse_tostr(mecab,jpnText);

// annnd转换回UTF-8
parsedText = convertEncoding(parsedText,20932,65001);

mecab_destroy(mecab);
}



建议/嘲讽?

解决方案

我遇到了这个线程寻找一种方法做同样的。我使用您的代码作为起点,并此博客帖子下面的代码给出了正确编码的输出:



<$> p $ p> public class Mecab
{
[DllImport(libmecab.dll,CallingConvention = CallingConvention.Cdecl,CharSet = CharSet.Unicode)]
private extern static IntPtr mecab_new2(string arg);
[DllImport(libmecab.dll,CallingConvention = CallingConvention.Cdecl,CharSet = CharSet.Unicode)]
private extern static IntPtr mecab_sparse_tostr(IntPtr m,byte [] str);
[DllImport(libmecab.dll,CallingConvention = CallingConvention.Cdecl,CharSet = CharSet.Unicode)]
private extern static void mecab_destroy(IntPtr m);

public static String Parse(String input)
{
IntPtr mecab = mecab_new2();
IntPtr nativeStr = mecab_sparse_tostr(mecab,Encoding.UTF8.GetBytes(input));
int size = nativeArraySize(nativeStr) - 1;
byte [] data = new byte [size];
Marshal.Copy(nativeStr,data,0,size);

mecab_destroy(mecab);

return Encoding.UTF8.GetString(data);
}

private static int nativeArraySize(IntPtr ptr)
{
int size = 0;
while(Marshal.ReadByte(ptr,size)> 0)
size ++;

return size;
}
}


I'm trying to use the Japanese morphological analyzer MeCab in a C# program (Visual Studio 2010 Express, Windows 7), and something's going wrong with the encoding. If my input (pasted into a textbox) is this:

一方、広義の「ネコ」は、ネコ類(ネコ科動物)の一部、あるいはその全ての獣を指す包括的名称を指す。

Then my output (in another textbox) looks like this:

?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
(   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
)   åè©ž,サ変接続,*,*,*,*,*
?   åè©ž,サ変接続,*,*,*,*,*
?????????????????????????   åè©ž,サ変接続,*,*,*,*,*
EOS

I would guess that that's text in some other encoding being mistaken for UTF-8-encoded text. But assuming that it's EUC-JP and using Encoding.Convert to turn it into UTF-8 doesn't change the output; assuming that it's Shift-JIS and doing the same gives different gibberish. Also, while it's definitely processing the text - that's how MeCab output is supposed to be formatted - it doesn't appear to be interpreting the input as UTF-8, either. If it were doing so, there wouldn't be all those identical lines in the output starting with one-character "compounds," which it's clearly unable to identify.

I get yet another different-looking set of gibberish when I run the sentence through MeCab's command line. But, again, it's just a row of single question marks and parentheses going down the left, so it's not just the problem that the Windows command line doesn't support fonts with Japanese characters; again, it's just not reading the input in as UTF-8. (I did install MeCab in UTF-8 mode.)

The relevant parts of the code look like this:

[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
private extern static IntPtr mecab_new2(string arg);
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
[return: MarshalAs(UnmanagedType.AnsiBStr)]
private extern static string mecab_sparse_tostr(IntPtr m, string str);
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
private extern static void mecab_destroy(IntPtr m);

private string meCabParse(string jpnText)
{
    IntPtr mecab = mecab_new2("");
    string parsedText = mecab_sparse_tostr(mecab, jpnText);

    mecab_destroy(mecab);
    return parsedText;
}

(In terms of fiddling with plausible-looking things to see if they make a difference, I've tried switching "UnmanagedType.AnsiBStr" to "UnmanagedType.BStr," which gives the error "AccessViolationException was unhandled," and adding "CharSet=CharSet.Unicode" to the DllImport parameters, which turned the output into just "EOS".)

This is how I've been doing the conversion:

// 65001 = UTF-8 codepage, 20932 = EUC-JP codepage
private string convertEncoding(string sourceString, int sourceCodepage, int targetCodepage)
{
    Encoding sourceEncoding = Encoding.GetEncoding(sourceCodepage); 
    Encoding targetEncoding = Encoding.GetEncoding(targetCodepage);

    // convert source string into byte array
    byte[] sourceBytes = sourceEncoding.GetBytes(sourceString);

    // convert those bytes into target encoding
    byte[] targetBytes = Encoding.Convert(sourceEncoding, targetEncoding, sourceBytes);

    // byte array to char array
    char[] targetChars = new char[targetEncoding.GetCharCount(targetBytes, 0, targetBytes.Length)];

    //char array to targt-encoded string
    targetEncoding.GetChars(targetBytes, 0, targetBytes.Length, targetChars, 0);
    string targetString = new string(targetChars);

    return targetString;
}

private string meCabParse(string jpnText)
{
    // convert the text from the string from UTF-8 to EUC-JP
    jpnText = convertEncoding(jpnText, 65001, 20932);

    IntPtr mecab = mecab_new2("");
    string parsedText = mecab_sparse_tostr(mecab, jpnText);

    // annnd convert back to UTF-8
    parsedText = convertEncoding(parsedText, 20932, 65001);

    mecab_destroy(mecab);
}

Suggestions/taunts?

解决方案

I came across this thread looking for a way to do the same. I used your code as a starting point and this blog post for figuring out how to marshal UTF8 strings.

The following code gives me properly encoded output:

public class Mecab
{
    [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet=CharSet.Unicode)]
    private extern static IntPtr mecab_new2(string arg);
    [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
    private extern static IntPtr mecab_sparse_tostr(IntPtr m, byte[] str);
    [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
    private extern static void mecab_destroy(IntPtr m);

    public static String Parse(String input)
    {
        IntPtr mecab = mecab_new2("");
        IntPtr nativeStr = mecab_sparse_tostr(mecab, Encoding.UTF8.GetBytes(input));
        int size = nativeArraySize(nativeStr) - 1;
        byte[] data = new byte[size];
        Marshal.Copy(nativeStr, data, 0, size);

        mecab_destroy(mecab);

        return Encoding.UTF8.GetString(data);
    }

    private static int nativeArraySize(IntPtr ptr)
    {
        int size = 0;
        while (Marshal.ReadByte(ptr, size) > 0)
            size++;

        return size;
    }
}

这篇关于试图获得libmecab.dll(MeCab)使用C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆