regexp_tokenize和阿拉伯文字 [英] regexp_tokenize and Arabic text

查看：144 发布时间：2020/5/18 1:20:13 python regex nltk

本文介绍了regexp_tokenize和阿拉伯文字的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 regexp_tokenize() 来从没有任何标点符号的阿拉伯文字:

I'm using regexp_tokenize() to return tokens from an Arabic text without any punctuation marks:

import re,string,sys
from nltk.tokenize import  regexp_tokenize

def PreProcess_text(Input):
  tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)
  return tokens

H = raw_input('H:')
Cleand= PreProcess_text(H)
print  '\n'.join(Cleand)

它工作正常，但是问题出在我尝试打印文本时.

It worked fine, but the problem is when I try to print the text.

文本ايمان،سعد的输出:

    ?يم
    ?ن
    ?
    ?
    ?

但是如果文本是英语，即使带有阿拉伯标点符号，它也会打印正确的结果.

but if the text is in English, even with an Arabic punctuation marks, it prints the right result.

文本hi،eman的输出:

     hi
     eman

推荐答案

使用raw_input时，符号被编码为字节.

When you use raw_input, the symbols are coded as bytes.

您需要使用以下命令将其转换为Unicode字符串

You need to convert it into a Unicode string with

H.decode('utf8')

您可以保留正则表达式:

And you may keep your regex:

tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)

这篇关于regexp_tokenize和阿拉伯文字的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

regexp_tokenize和阿拉伯文字 [英] regexp_tokenize and Arabic text

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

regexp_tokenize和阿拉伯文字 [英] regexp_tokenize and Arabic text

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭