在项目符号和编号处拆分句子? [英] Split Sentences at Bullets and Numbering?
本文介绍了在项目符号和编号处拆分句子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试将文本输入到文字处理器中,然后先将其拆分为句子,然后再分解为单词.
I am trying to input text into my word processor to be split into sentences first and then into words.
一个示例段落:
When the blow was repeated,together with an admonition in
childish sentences, he turned over upon his back, and held his paws in a peculiar manner.
1) This a numbered sentence
2) This is the second numbered sentence
At the same time with his ears and his eyes he offered a small prayer to the child.
Below are the examples
- This an example of bullet point sentence
- This is also an example of bullet point sentence
我尝试了以下代码
from nltk.tokenize import TweetTokenizer, sent_tokenize
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in
nltk.sent_tokenize(input_text)]
print(tokens_sentences)
import nltk
sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences
# now loop over each sentence and tokenize it separately
for sentence in sent_text:
tokenized_text = nltk.word_tokenize(sentence)
print(tokenized_text)
我得到的输出
[
['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'],
['1', ')', 'This', 'a', 'numbered', 'sentence', '2', ')', 'This', 'is', 'the', 'second', 'numbered', 'sentence','At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
['Below', 'are', 'the', 'examples', '-', 'This', 'an', 'example', 'of', 'bullet', 'point', 'sentence',
'-', 'This', 'also','an', 'example', 'of', 'bullet', 'point', 'sentence']
]
必需的输出
[
['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'],
['1', ')', 'This', 'a', 'numbered', 'sentence']
['2', ')', 'This', 'is', 'the', 'second', 'numbered', 'sentence']
['At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
['Below', 'are', 'the', 'examples']
['-', 'This', 'an', 'example', 'of', 'bullet', 'point', 'sentence']
['-', 'This', 'also','an', 'example', 'of', 'bullet', 'point', 'sentence']
]
如何在项目符号和编号"处拆分句子?
关于spaCy的解决方案也将非常有帮助
Solutions on spaCy also would be very helpful
推荐答案
这可以解决.您可以根据自己的数据进行自定义
this can be a solution. you can customize it according to your data
text = """When the blow was repeated,together with an admonition in
childish sentences, he turned over upon his back, and held his paws in a peculiar manner.
1) This a numbered sentence
2) This is the second numbered sentence
At the same time with his ears and his eyes he offered a small prayer to the child.
Below are the examples
- This an example of bullet point sentence
- This is also an example of bullet point sentence"""
import re
import nltk
sentences = nltk.sent_tokenize(text)
results = []
for sent in sentences:
sent = re.sub(r'(\n)(-|[0-9])', r"\1\n\2", sent)
sent = sent.split('\n\n')
for s in sent:
results.append(nltk.word_tokenize(s))
results
[
['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'],
['1', ')', 'This', 'a', 'numbered', 'sentence']
['2', ')', 'This', 'is', 'the', 'second', 'numbered', 'sentence']
['At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
['Below', 'are', 'the', 'examples']
['-', 'This', 'an', 'example', 'of', 'bullet', 'point', 'sentence']
['-', 'This', 'also','an', 'example', 'of', 'bullet', 'point', 'sentence']
]
这篇关于在项目符号和编号处拆分句子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文