用于计算输入文件中的句子，单词和字符数的代码 [英] code for counting number of sentences, words and characters in an input file

查看：76 发布时间：2020/5/18 1:22:08 python nltk

本文介绍了用于计算输入文件中的句子，单词和字符数的代码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我编写了以下代码来计算输入文件sample.txt中包含一段文本的句子，单词和字符的数量.它可以很好地提供句子和单词的数量，但不能提供准确和正确的字符数(没有空格和标点符号)

I have written the following code to count the number of sentences, words and characters in the input file sample.txt, which contains a paragraph of text. It works fine in giving the number of sentences and words, but does not give the precise and correct number of characters ( without whitespaces and punctuation marks)

lines,blanklines,sentences,words=0,0,0,0
num_chars=0


print '-'*50
print '-'*50
try:
    filename = 'sample.txt'
    textf = open(filename,'r')c
except IOError:
    print 'cannot open file %s for reading' % filename
    import sys
    sys.exit(0)
try:
    filename = 'sample.txt'
    textf = open(filename,'r')c
except IOError:
    print 'cannot open file %s for reading' % filename
    import sys
    sys.exit(0)
for line in textf:
    print line
    lines += 1
    if line.startswith('\n'):
        blanklines += 1
    else:
for line in textf:
    print line
    lines += 1
    if line.startswith('\n'):
        blanklines += 1
    else:
    sentences += line.count('.')+ line.count ('!')+ line.count('?')

    tempwords = line.split(None)
    print tempwords 
    words += len(tempwords)

 textf.close()
textf.close()
打印'-'* 50
打印行:"，行
打印空白行:"，空白行
打印句子:"，句子
打印单词:"，单词
print '-'*50
print "Lines:", lines
print "blank lines:",blanklines
print "sentences:",sentences
print "words:",words
导入nltk
导入nltk.data
导入nltk.tokenize 
import nltk
import nltk.data
import nltk.tokenize
，其中open('sample.txt'，'r')为f:
    对于f中的行:
        num_chars + = len(line)
with open('sample.txt' , 'r') as f:
    for line in f:
        num_chars += len(line)
 num_chars = num_chars-(单词+1)
num_chars = num_chars - (words +1 )
 pcount = 0
从nltk.tokenize导入TreebankWordTokenizer
将open('sample.txt'，'r')设为f1:
    对于f1中的行:
        #tokenised_words = nltk.tokenize.word_tokenize(行)
        tokenizer = TreebankWordTokenizer()
        tokenised_words = tokenizer.tokenize(line)
    对于tokenized_words中的w:
        如果((w =='.')|(w ==';')|(w =='！')|(w =='?'))):
            pcount = pcount + 1
打印"pcount:"，pcount
num_chars = num_chars-pcount
打印字符:"，num_chars 
pcount = 0
from nltk.tokenize import TreebankWordTokenizer
with open('sample.txt','r') as f1:
    for line in f1:
        #tokenised_words = nltk.tokenize.word_tokenize(line)
        tokenizer = TreebankWordTokenizer()
        tokenised_words = tokenizer.tokenize(line)
    for w in tokenised_words:
        if ((w=='.')|(w==';')|(w=='!')|(w=='?')):
            pcount = pcount + 1
print "pcount:",pcount
num_chars = num_chars - pcount
print "chars:",num_chars

pcount是标点符号的数量.有人可以建议我需要进行哪些更改，以找出没有空格和标点符号的确切字符数吗?

pcount is the number of punctuation marks. Can some suggest the changes I need to make in order to find out the exact number of characters without spaces and punctuation marks?

推荐答案

import string

#
# Per-line counting functions
#
def countLines(ln):      return 1
def countBlankLines(ln): return 0 if ln.strip() else 1
def countWords(ln):      return len(ln.split())

def charCounter(validChars):
    vc = set(validChars)
    def counter(ln):
        return sum(1 for ch in ln if ch in vc)
    return counter
countSentences = charCounter('.!?')
countLetters   = charCounter(string.letters)
countPunct     = charCounter(string.punctuation)

#
# do counting
#
class FileStats(object):
    def __init__(self, countFns, labels=None):
        super(FileStats,self).__init__()
        self.fns    = countFns
        self.labels = labels if labels else [fn.__name__ for fn in countFns]
        self.reset()

    def reset(self):
        self.counts = [0]*len(self.fns)

    def doFile(self, fname):
        try:
            with open(fname) as inf:
                for line in inf:
                    for i,fn in enumerate(self.fns):
                        self.counts[i] += fn(line)
        except IOError:
            print('Could not open file {0} for reading'.format(fname))

    def __str__(self):
        return '\n'.join('{0:20} {1:>6}'.format(label, count) for label,count in zip(self.labels, self.counts))

fs = FileStats(
    (countLines, countBlankLines, countSentences, countWords, countLetters, countPunct),
    ("Lines",    "Blank Lines",   "Sentences",    "Words",    "Letters",    "Punctuation")
)
fs.doFile('sample.txt')
print(fs)

产生

Lines                   101
Blank Lines              12
Sentences                48
Words                   339
Letters                1604
Punctuation             455

这篇关于用于计算输入文件中的句子，单词和字符数的代码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用于计算输入文件中的句子，单词和字符数的代码 [英] code for counting number of sentences, words and characters in an input file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

用于计算输入文件中的句子，单词和字符数的代码 [英] code for counting number of sentences, words and characters in an input file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭