如果某些字符是中文,我们如何将utf-8字符分成单词? [英] How can we separate utf-8 characters into words if some of the characters are chinese?

查看:125
本文介绍了如果某些字符是中文,我们如何将utf-8字符分成单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写一个程序.该程序获取了一个utf8字符串并将其拆分为单词.对于拉丁字符,这很简单.根据空间拆分.对于汉字,这也很简单.每个字符都是一个单词.

I make a program. The program got a utf8 string and split that into words. For latin characters, it's simple. Split based on space. For chinese character, it's also simple. Every character is a word.

如果字符串混合在一起怎么办?

What about if the strings are mixed?

我该怎么办?

我想我可以检测出该字符是否为中文,或者该字符是空格分隔的单词还是没有分隔符的单词.

I suppose I could detect whether the character is chinese or not, or whether the character is space separated words or nothing separated words.

执行此操作的标准方法是什么?

What's the standard way to do this?

例如,我想分割

或者我应该基于非字母数字的内容(包括非拉丁文字和重音符号上的其他字母数字)进行拆分?如果是这样,我应该如何进行?是否可以使用匹配的正则表达式来匹配字母数字,重音词,希伯来语alibeth,阿拉伯abjad之类的东西?

Or perhaps I should split based on anything not alphanumeric (including other alpha numeric on non latin scripts and accents?). If so how should I proceed? Is there a regex for that match anything not alphanumeric, accented words, hebrew alibeth, arab abjad, and stuff?

我喜欢马

I
Like
Horse

我想分开 北小金駅南口第1自転车驻车场 进入

I want to split 北小金駅南口第1自転車駐車場 into

北
小
金
駅
南
...

因为中文中的每个字符都是单词.

Because each character in chinese is word.

使这个问题棘手的是,汉字和西方字符之间的单词拆分是不同的.西方字符用空格隔开,而汉字则用空格隔开.

What makes this problem tricky is that word split is different between chinese characters and western characters. Western characters are separated by space and chinese characters are separated by nothing.

我想我们可以在分割之前先检测一下汉字是否是中文.很好,但是那样的话,我也不知道该怎么做.

I suppose we can detect whether the character is chinese or not first before we split. That would be fine but then, I don't know how to do so either.

推荐答案

使用正则表达式-使用类似\b的元字符应捕获所有单词边界字符,无论与它们相关的语言如何.

Use regular expressions - using a meta character like \b should capture all word boundary characters, whatever language is associated with them.

Regex.Split(myString, "\b", RegexOptions.None)

这篇关于如果某些字符是中文,我们如何将utf-8字符分成单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆