查找字符的长流的话。自动标记化 [英] Find the words in a long stream of characters. Auto-tokenize

查看:175
本文介绍了查找字符的长流的话。自动标记化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你会如何找到正确的词语字符的长流?

How would you find the correct words in a long stream of characters?

输入:

"The revised report onthesyntactictheoriesofsequentialcontrolandstate"

谷歌的输出:

Google's Output:

"The revised report on syntactic theories sequential controlandstate"

(这是足够接近考虑它们产生的输出的时间)

(which is close enough considering the time that they produced the output)

您如何看待谷歌不是吗? 你将如何提高精度?

How do you think Google does it? How would you increase the accuracy?

推荐答案

我会尝试递归算法是这样的:

I would try a recursive algorithm like this:

  • 尝试在每个位置插入空格。如果左侧部分是一个字,然后复发在右边部分。
  • 计数在所有的最终输出的有效字/总字的数目的数目。一个最好的比例是有可能的答案。

例如,给它thesentenceisgood将运行:

For example, giving it "thesentenceisgood" would run:

thesentenceisgood
the sentenceisgood
    sent enceisgood
         enceisgood: OUT1: the sent enceisgood, 2/3
    sentence isgood
             is good
                go od: OUT2: the sentence is go od, 4/5
             is good: OUT3: the sentence is good, 4/4
    sentenceisgood: OUT4: the sentenceisgood, 1/2
these ntenceisgood
      ntenceisgood: OUT5: these ntenceisgood, 1/2

所以,你会选择OUT3作为答案。

So you would pick OUT3 as the answer.

这篇关于查找字符的长流的话。自动标记化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆