如何使用Python恢复标点符号? [英] How to restore punctuation using Python?
问题描述
我想恢复逗号和句号而不加标点.例如,让我们看一下这句话:
I would like to restore commas and full stops in text without punctuation. For example, let's take this sentence:
I am XYZ I want to execute I have a doubt
在上面的示例中,我想检测到应该有1个逗号和1个句号:
And I would like to detect that there should be 1 commas and 1 full stop in the above example:
I am XYZ, I want to execute. I have a doubt.
有人可以建议我如何使用Python和NLP概念实现这一目标吗?
Can anyone advise me on how to achieve this using Python and NLP concepts?
推荐答案
如果我很好理解,您想通过添加适当的标点来提高句子的质量.有时称为标点恢复.
If I understand well, you want to improve the quality of a sentence by adding the appropriate punctuation. This is sometimes called punctuation restoration.
一个好的第一步是应用常规的NLP管道,即令牌化, POS标记和 NLTK a>或 Spacy .
A good first step is to apply the usual NLP pipeline, namely tokenization, POS tagging, and parsing, using libraries such as NLTK or Spacy.
完成此预处理后,您将必须基于NLP管道中提取的功能(例如,句子边界,解析树,POS),应用基于规则或机器学习的方法来定义标点符号的位置等).
Once this preprocessing is done, you'll have to apply a rule-based or a machine learning approach to define where the punctuation should be, based on the features extracted from the NLP pipeline (e.g. sentence boundaries, parsing tree, POS, etc.).
但是,这并不是一件微不足道的任务.如果要自定义算法,可能需要强大的NLP/AI技能.
However this is not a trivial task. It can require strong NLP/AI skills if you want to customise your algorithm.
一些可以重用的示例:
- Here is a simple approach using Spacy, mainly based on sentence boundaries.
- Here is a more complex solution, using the Theano deep learning library.
这篇关于如何使用Python恢复标点符号?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!