如何使用 Python 恢复标点符号? [英] How to restore punctuation using Python?

查看:126
本文介绍了如何使用 Python 恢复标点符号?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在没有标点符号的文本中恢复逗号和句号.例如,让我们看这句话:

I would like to restore commas and full stops in text without punctuation. For example, let's take this sentence:

I am XYZ I want to execute I have a doubt

而且我想检测在上面的例子中应该有 1 个逗号和 1 个句号:

And I would like to detect that there should be 1 commas and 1 full stop in the above example:

I am XYZ, I want to execute. I have a doubt.

谁能告诉我如何使用 Python 和 NLP 概念来实现这一目标?

Can anyone advise me on how to achieve this using Python and NLP concepts?

推荐答案

如果我理解的很好,你想通过添加适当的标点符号来提高句子的质量.这有时称为标点恢复.

If I understand well, you want to improve the quality of a sentence by adding the appropriate punctuation. This is sometimes called punctuation restoration.

好的第一步是应用通常的 NLP 管道,即标记化词性标注解析,使用诸如 NLTK 之类的库a> 或 Spacy.

A good first step is to apply the usual NLP pipeline, namely tokenization, POS tagging, and parsing, using libraries such as NLTK or Spacy.

完成此预处理后,您必须应用基于规则或机器学习的方法,根据从 NLP 管道中提取的特征(例如句子边界、解析树、POS等).

Once this preprocessing is done, you'll have to apply a rule-based or a machine learning approach to define where the punctuation should be, based on the features extracted from the NLP pipeline (e.g. sentence boundaries, parsing tree, POS, etc.).

然而,这不是一项微不足道的任务.如果您想自定义算法,则可能需要强大的 NLP/AI 技能.

However this is not a trivial task. It can require strong NLP/AI skills if you want to customise your algorithm.

一些可以重复使用的例子:

  • 这里是一个使用Spacy的简单方法,主要基于句子边界.
  • 这里是一个更复杂的解决方案,使用Theano 深度学习库.
  • Here is a simple approach using Spacy, mainly based on sentence boundaries.
  • Here is a more complex solution, using the Theano deep learning library.

这篇关于如何使用 Python 恢复标点符号?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆