用于计算输入文件中的句子,单词和字符数的代码 [英] code for counting number of sentences, words and characters in an input file
问题描述
我编写了以下代码来计算输入文件sample.txt中包含一段文本的句子,单词和字符的数量.它可以很好地提供句子和单词的数量,但不能提供准确和正确的字符数(没有空格和标点符号)
I have written the following code to count the number of sentences, words and characters in the input file sample.txt, which contains a paragraph of text. It works fine in giving the number of sentences and words, but does not give the precise and correct number of characters ( without whitespaces and punctuation marks)
lines,blanklines,sentences,words=0,0,0,0
num_chars=0
print '-'*50
print '-'*50
try: filename = 'sample.txt' textf = open(filename,'r')c except IOError: print 'cannot open file %s for reading' % filename import sys sys.exit(0)
try: filename = 'sample.txt' textf = open(filename,'r')c except IOError: print 'cannot open file %s for reading' % filename import sys sys.exit(0)
for line in textf: print line lines += 1 if line.startswith('\n'): blanklines += 1 else:
for line in textf: print line lines += 1 if line.startswith('\n'): blanklines += 1 else:
sentences += line.count('.')+ line.count ('!')+ line.count('?')
tempwords = line.split(None)
print tempwords
words += len(tempwords)
textf.close()
textf.close()
打印'-'* 50 打印行:",行 打印空白行:",空白行 打印句子:",句子 打印单词:",单词
print '-'*50 print "Lines:", lines print "blank lines:",blanklines print "sentences:",sentences print "words:",words
导入nltk 导入nltk.data 导入nltk.tokenize
import nltk import nltk.data import nltk.tokenize
,其中open('sample.txt','r')为f: 对于f中的行: num_chars + = len(line)
with open('sample.txt' , 'r') as f: for line in f: num_chars += len(line)
num_chars = num_chars-(单词+1)
num_chars = num_chars - (words +1 )
pcount = 0 从nltk.tokenize导入TreebankWordTokenizer 将open('sample.txt','r')设为f1: 对于f1中的行: #tokenised_words = nltk.tokenize.word_tokenize(行) tokenizer = TreebankWordTokenizer() tokenised_words = tokenizer.tokenize(line) 对于tokenized_words中的w: 如果((w =='.')|(w ==';')|(w =='!')|(w =='?'))): pcount = pcount + 1 打印"pcount:",pcount num_chars = num_chars-pcount 打印字符:",num_chars
pcount = 0 from nltk.tokenize import TreebankWordTokenizer with open('sample.txt','r') as f1: for line in f1: #tokenised_words = nltk.tokenize.word_tokenize(line) tokenizer = TreebankWordTokenizer() tokenised_words = tokenizer.tokenize(line) for w in tokenised_words: if ((w=='.')|(w==';')|(w=='!')|(w=='?')): pcount = pcount + 1 print "pcount:",pcount num_chars = num_chars - pcount print "chars:",num_chars
pcount是标点符号的数量.有人可以建议我需要进行哪些更改,以找出没有空格和标点符号的确切字符数吗?
pcount is the number of punctuation marks. Can some suggest the changes I need to make in order to find out the exact number of characters without spaces and punctuation marks?
推荐答案
import string
#
# Per-line counting functions
#
def countLines(ln): return 1
def countBlankLines(ln): return 0 if ln.strip() else 1
def countWords(ln): return len(ln.split())
def charCounter(validChars):
vc = set(validChars)
def counter(ln):
return sum(1 for ch in ln if ch in vc)
return counter
countSentences = charCounter('.!?')
countLetters = charCounter(string.letters)
countPunct = charCounter(string.punctuation)
#
# do counting
#
class FileStats(object):
def __init__(self, countFns, labels=None):
super(FileStats,self).__init__()
self.fns = countFns
self.labels = labels if labels else [fn.__name__ for fn in countFns]
self.reset()
def reset(self):
self.counts = [0]*len(self.fns)
def doFile(self, fname):
try:
with open(fname) as inf:
for line in inf:
for i,fn in enumerate(self.fns):
self.counts[i] += fn(line)
except IOError:
print('Could not open file {0} for reading'.format(fname))
def __str__(self):
return '\n'.join('{0:20} {1:>6}'.format(label, count) for label,count in zip(self.labels, self.counts))
fs = FileStats(
(countLines, countBlankLines, countSentences, countWords, countLetters, countPunct),
("Lines", "Blank Lines", "Sentences", "Words", "Letters", "Punctuation")
)
fs.doFile('sample.txt')
print(fs)
产生
Lines 101
Blank Lines 12
Sentences 48
Words 339
Letters 1604
Punctuation 455
这篇关于用于计算输入文件中的句子,单词和字符数的代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!