将文字解析成文字? [英] Parse text into words?

查看:71
本文介绍了将文字解析成文字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一种非常有效的方法来解析

字边界上的大量文本(GB)。只要他们没有添加
,就会将单词添加到数组中。由于标点符号仍然存在,因此在空格上分裂有点过于基本了b / b
。也许正则表达式?


感谢您的任何见解。


Jim

I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven''t already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim

推荐答案

ji*******@hotmail.com 写道:
我需要一种非常有效的方法来解析
字边界上的大量文本(GB)。只要它们尚未添加,单词就会被添加到数组中。由于标点符号仍然存在,因此拆分空间有点太基本了。也许正则表达式?
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven''t already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?




你有几个选择。正则表达式分裂可以做你想要的;只是

拆分为[,。!?;:]。你也可以为你的单词定义一个正则表达式和

使用匹配()。


另一种选择是写一个词法分析器(词法分析器)。可能

是旧可靠的Lex和Flex的一些.Net等价物。不确定是否

他们在这种情况下会更快,而且对我来说似乎是一种巨大的杀戮。


或者如果你真的疯了,你可以手写一个词法分析器。 :-)



You''ve got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

The other option is to write a lexical analyzer (lexer). There might
be some .Net equivalents of the old reliable Lex and Flex. Not sure if
they''d be faster in this case, and seem like massive over kill to me.

Or if you''re really insane, you can hand-write a lexical analyzer. :-)


为什么你不能拆分空间并替换标点符号(因为那里只有
)没有什么类型的标点符号?这个

似乎是最有效和简单的方法。


Dim x As String = veryLargeString


y = y.Replace(",","")

y = y.Replace("。","")

y = y.Replace(":","")

y = y.Replace(" ;;","")


Dim y As Array = x.Split("")


< ji ******* @ hotmail.com>在留言中写道

news:11 ********************** @ u72g2000cwu.googlegr oups.com ...
Why can''t you split on the space and replace the punctuation (since there
will only be a limited amount of types of punctuation) with nothing? This
seems to be the most efficient and simple way to do it.

Dim x As String = veryLargeString

y = y.Replace(", "," ")
y = y.Replace(". "," ")
y = y.Replace(": "," ")
y = y.Replace("; "," ")

Dim y As Array = x.Split(" ")

<ji*******@hotmail.com> wrote in message
news:11**********************@u72g2000cwu.googlegr oups.com...
我需要一种非常有效的方法来解析
字边界上的大量文本(GB)。只要它们尚未添加,单词就会被添加到数组中。由于标点符号仍然存在,因此拆分空间有点太基本了。也许正则表达式?

感谢您的任何见解。

Jim
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven''t already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim



Jim,


如果我理解你将是VB方法的组合Instr和

a sortedlist是实现你想要的最快捷方式。


你去的不是通过你的文字,当你在一个循环中找到你每次更新时

起始点fron instr当你设置你在关键字中找到的单词$ / b $ b排序列表的字典对

http://msdn.microsoft.com/library/de...vafctinstr.asp

http:// msdn。 microsoft.com/library/de...classtopic.asp


从Regex你可以来自一个可以肯定的是,它可能比上面的时间多花费至少50美元b $ b倍。


我希望这会有所帮助,


Cor

< ji ******* @ hotmail.com> schreef in bericht

news:11 ********************** @ u72g2000cwu.googlegr oups.com ...
Jim,

If I understand you well will be the combination of the VB method Instr and
a sortedlist be the quickest way to achieve what you want.

You go than through your text and when found in a loop you update everytime
the starting point fron instr while you set the word you found in the key of
the dictionary pair of the sortedlist

http://msdn.microsoft.com/library/de...vafctinstr.asp

http://msdn.microsoft.com/library/de...classtopic.asp

From Regex you can be from one thing sure, it will take probably at least 50
times more time than as above as above.

I hope this helps,

Cor
<ji*******@hotmail.com> schreef in bericht
news:11**********************@u72g2000cwu.googlegr oups.com...
我需要一种非常有效的方法来解析
字边界上的大量文本(GB)。只要它们尚未添加,单词就会被添加到数组中。由于标点符号仍然存在,因此拆分空间有点太基本了。也许正则表达式?

感谢您的任何见解。

Jim
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven''t already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim



这篇关于将文字解析成文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆