将文本拆分为段NLTK-使用nltk.tokenize.texttiling吗? [英] Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling?

查看:129
本文介绍了将文本拆分为段NLTK-使用nltk.tokenize.texttiling吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找将文档拆分为段落的方法,并且遇到了

I was looking at methods to split documents into paragraphs and I came across texttiling as one possible way to do this.

这是我尝试使用它的尝试.但是,我不明白如何使用输出.多谢您的协助.

Here is my attempt to use it. However, I don't understand how to work with the output. I'd appreciate your help.

t = unidecode(doclist[0].decode('utf-8','ignore'))

nltk.tokenize.texttiling.TextTilingTokenizer(t)

输出:

<nltk.tokenize.texttiling.TextTilingTokenizer at 0x11e9c6350>

推荐答案

我现在正和这个人搞混,原因与您相同,并且遇到了同样的问题,因此如果这个问题不要太难过是错的.我想最好地传达我所不知道的... :)

I'm messing around with this one myself just now for the same reason you are and had the same question you did so don't be too upset if this is wrong. I figured best to pass on what little I know... :)

我不确定,但是我在此错误报告使用TextTilingTokenizer的示例:

I'm not sure yet but I found in this bug report an example of using the TextTilingTokenizer:

alice=nltk.corpus.gutenberg.raw('carroll-alice.txt')
ttt = nltk.tokenize.TextTilingTokenizer()
tiles = ttt.tokenize(alice[140309 : ])

您似乎想要将文本提供给TextTilingTokenizer上的tokenize方法.

It appears that you want to feed your text to the tokenize method on the the TextTilingTokenizer.

这篇关于将文本拆分为段NLTK-使用nltk.tokenize.texttiling吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆