Python:用词组标记 [英] Python: Tokenizing with phrases

查看:147
本文介绍了Python:用词组标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些要标记的文本块,但是我不想在空白和标点符号上标记,这似乎是西翼",生成的令牌将为:

  • 西翼
  • 一个
  • 美国
  • ...

完成此操作的最佳方法是什么?我宁愿停留在NLTK之类的工具范围之内.

解决方案

如果您要查找一组固定的短语,则简单的解决方案是标记输入内容并重组"多单词标记.或者,进行正则表达式搜索和在标记之前将其替换,从而将The West Wing转换为The_West_Wing.

有关更多高级选项,请使用 NLTK书的第7章. /p>

I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to be tokenized as a single token, instead of the regular tokenization.

For example, given the sentence "The West Wing is an American television serial drama created by Aaron Sorkin that was originally broadcast on NBC from September 22, 1999 to May 14, 2006," and adding the phrase to the tokenizer "the west wing," the resulting tokens would be:

  • the west wing
  • is
  • an
  • american
  • ...

What's the best way to accomplish this? I'd prefer to stay within the bounds of tools like NLTK.

解决方案

If you have a fixed set of phrases that you're looking for, then the simple solution is to tokenize your input and "reassemble" the multi-word tokens. Alternatively, do a regexp search & replace before tokenizing that turns The West Wing into The_West_Wing.

For more advanced options, use regexp_tokenize or see chapter 7 of the NLTK book.

这篇关于Python:用词组标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆