如何总结对话中每个人的字数? [英] How to sum up the word count for each person in a dialogue?

查看:176
本文介绍了如何总结对话中每个人的字数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


我开始学习Python,我正在尝试编写一个程序来导入一个文本文件,计算总数,计算特定段落中的单词数(每个参与者说) ,由'P1','P2'等描述),从我的字数中排除这些字(即P1等),并分别打印段落。



感谢@James Hurford我得到这个代码:

  words = None 
with open('data.txt') as f:
words = f.read()。split()
total_words = len(words)
print'Total words:',total_words

in_para = False
para_type = None
paragraph = list()
单词中的单词:
if('P1'in word或
'P2'in word or
'P3'in word):
如果in_para == False:
in_para = True
para_type = word
else:
print'段落中的词' para_type,':',len(paragraph)
print''.join(段落)
del段落[:]
para_type = word
else:
段落。 append(word)
else:
if in_para == True:
print'last paragraph中的单词',para_type,':',len(paragraph)
print''。 join(paragraph)
else:
print'No words'

文件如下所示:



P2:


$ b Bla bla bla。



P1:Bla bla。



P3:Bla。


我需要做的下一部分是总结每个参与者的词。我只能打印它们,但我不知道如何返回/重复使用它们。



我需要一个新的变量与字计数为每个参与者,我可以操纵稍后,除了总结每个参与者所说的所有词之外,例如,

  P1all =段落中的字词总数

有没有办法计算你是或它的等两个字?



任何想法如何解决方案?

解决方案

恭喜你用Python开始你的冒险!不是这篇文章中的一切都可能有意义,但书签它,并回来了,如果它似乎有用的,以后。最终你应该尝试从脚本编写到软件工程,这里有几个想法!



强大的功能带来了巨大的责任,作为一个Python开发人员,你需要比其他语言更加严格,这些语言没有你的手,执行好的设计。



我发现有助于从自上而下的设计开始。

  def main():
text = get_text()
p_text = process_text $ b catalog = process_catalogue(p_text)

BOOM!你只是写了整个程序 - 现在你只需要回来填补空白!当你这样做,似乎不那么吓人。就个人而言,我不认为自己足够聪明解决非常大的问题,但我是一个解决小问题的亲。所以,让我们一次处理一件事。我将从'process_text'开始。

  def process_text(text):
b_text = bundle_dialogue_items )
f_text = filter_dialogue_items(b_text)
c_text = clean_dialogue_items(f_text)

我不是很确定这些东西是什么意思,但我知道文本问题倾向于遵循一个模式称为地图/减少,这意味着你执行和操作的东西,然后你清理它并结合,所以我把一些占位符函数。



现在让我们写'process_catalogue'。我可以写process_dict,但对我来说听起来很不方便。

  def process_catalogue(p_text):
= make_catalogue(c_text)
s_speakers = sum_words_per_paragraph_items(speakers)
t_speakers = total_word_count(s_speakers)


$ b b <冷>。不太糟糕。你可能会接近这不同于我,但我认为这将是有意义的聚合项目,计数每个段落的词,然后计数所有的话。



所以,在这一点上,我可能会做一两个小的'lib'(库)模块来回填剩下的函数。为了你能够运行这个而不用担心导入,我会把它所有在一个.py文件,但最终你会学到如何打破这些,因此它看起来更好。所以让我们这样做。

 #------------------#
#= = process_text ==#
#------------------#

def bundle_dialogue_items(lines):
cur_speaker = None
paragraph = Counter()
行中的行:
如果re.match(p,line):
cur_speaker,dialogue = line.split(':')
段[cur_speaker] + = 1
else:
dialog = line

res = cur_speaker,对话,段落[cur_speaker]
yield res


def filter_dialogue_items(lines):
名称,对话,段落段落:
如果对话:
res = name,对话,段落
yield res

def clean_dialogue_items(flines):
名称,对话,flines中的段落:
s_dialogue = dialogue.strip()。split()
c_dialouge = clean_word(w)for w in s_dialogue]
res = name,c_dialouge,paragraph
yield res

aaa和一个小助手函数

 #--------------- ----#
#== aux functions ==#
#-------------------#

to_clean = string.whitespace + string.punctuation
def clean_word(word):
res =''.join(c for c in word if c not in to_clean)
return res

所以这可能不是很明显,但是这个库被设计为一个数据处理管道。有几种方法来处理数据,一种是流水线处理,另一种是批处理。让我们来看看批处理。

 #----------------- ------#
#== process_catalogue ==#
#-----------------------#

speaker_stats ='stats'
def make_catalogue(names_with_dialogue):
speakers = {}
名称,对话,names_with_dialogue中的段落:
speaker = speakers.setdefault (name,{})
stats = speaker.setdefault(speaker_stats,{})
stats.setdefault(paragraph,[])extend(dialog)
return speakers
b

$ b word_count ='word_count'
def sum_words_per_paragraph_items(speakers):
for speaker in speakers:
word_stats = speakers [speaker] [speaker_stats]
speakers [speaker] [word_count] = Counter()
for word_stats中的段落:
speakers [speaker] [word_count] [paragraph] + = len(word_stats [paragraph])
return音箱


total ='total'
def total_word_count(speakers):
for speaker in speakers:
wc = speakers [speaker] [word_count]
音箱[speaker] [total] = 0
for c in wc:
speaker [speaker] [total] + = wc [c]
返回音箱

所有这些嵌套的字典都有点复杂。在实际的生产代码中,我会用一些更可读的类(添加测试和docstrings !!)来替换这些类,但我不想让它比以前更混乱!好吧,为了您的方便下面是整个放在一起。

  import pprint 
import re
import string
from collections import Counter

p = re.compile(r'(\w +?):')


def get_text_line_items(text):
for text.split \\\
'):
yield line


def bundle_dialogue_items(lines):
cur_speaker = None
paragraph = Counter()
行中的行:
if re.match(p,line):
cur_speaker,dialogue = line.split(':')
paragraph [cur_speaker] + = 1
else:
dialog = line

res = cur_speaker,对话,段落[cur_speaker]
yield res


def filter_dialogue_items :
名称,对话,行中的段落:
如果对话:
res =名称,对话,段落
yield res


to_clean = string.whitespace + string.punctuation


def clean_word(word):
res =''.join(c如果c不在to_clean中则为c) b $ b return res


def clean_dialogue_items(flines):
用于名称,对话,flines中的段落:
s_dialogue = dialogue.strip )
c_dialouge = [clean_word(w)for w in s_dialogue]
res = name,c_dialouge,paragraph
yield res


speaker_stats ='stats '


def make_catalogue(names_with_dialogue):
speakers = {}
名称,对话,names_with_dialogue中的段落:
speaker = speakers.setdefault name,{})
stats = speaker.setdefault(speaker_stats,{})
stats.setdefault(paragraph,[])。extend(dialog)
return speakers


def clean_dict(speakers):
for speaker in speakers:
stats = speakers [speaker] [speaker_stats]
for stats:
stats [paragraph ] = [''.join(c如果c不在to_clean中则为c)
for stats [paragraph]]
返回说话者


word_count ='word_count'


def sum_words_per_paragraph_items(speakers):
for speaker in speakers:
word_stats = speakers [speaker] [speaker_stats]
speakers [ speaker] [word_count] = Counter()
for word_stats中的段落:
speakers [speaker] [word_count] [paragraph] + = len(word_stats [paragraph])
return speakers


total ='total'


def total_word_count(speakers):
for speaker in speakers:
wc = speakers [speaker ] [word_count]
speaker [speaker] [total] = 0
for c in wc:
speakers [speaker] [total] + = wc [c]
b

$ b def get_text():
text ='''BOB:blah blah blah blah
blah hello goodbye等

JERRY: .............................................
。 ..............

BOB:blah blah blah
blah blah blah
blah。
BOB:boopy doopy doop
P1:Bla bla bla。
P2:Bla bla bla bla。
P1:Bla bla。
P3:Bla。'''
text = get_text_line_items(text)
return text


def process_catalogue(c_text):
= make_catalogue(c_text)
s_speakers = sum_words_per_paragraph_items(speakers)
t_speakers = total_word_count(s_speakers)
return t_speakers


def process_text b $ b b_text = bundle_dialogue_items(text)
f_text = filter_dialogue_items(b_text)
c_text = clean_dialogue_items(f_text)
return c_text


def main ):

text = get_text()
c_text = process_text(text)
t_speakers = process_catalogue(c_text)

#看看你的硬工作!
pprint.pprint(t_speakers)


如果__name__ =='__main__':
main()
pre>

所以这个脚本对于这个应用程序来说肯定是过度的,但关键是看看什么(有问题的)可读的,可维护的,模块化的Python代码。



相当肯定的输出如下:

  {'BOB' :{'stats':{1:['blah',
'blah',
'blah',
'blah',
'blah',
'hello',
'goodbye',
'etc'],
2:['blah',
'blah',
'blah',
'blah',
'blah',
'blah',
'blah'],
3:['boopy','doopy','doop'] },
'total':18,
'word_count':Counter({1:8,2:7,3:3})},
'JERRY':{'stats' :{1:['','']},'total':2,'word_count':Counter({1:2})},
'P1':{'stats':{1: 'bla'],2:['Bla','bla']},
'total':5,
'word_count' ,2:2})},
'p2':{'stats':{1:['Bla','bla','bla','bla']} :4,
'word_count':Counter({1:4})},
'P3':{'stats':{1:['Bla']},'total' 'word_count':Counter({1:1})}}



I'm starting to learn Python and I'm trying to write a program that would import a text file, count the total number of words, count the number of words in a specific paragraph (said by each participant, described by 'P1', 'P2' etc.), exclude these words (i.e. 'P1' etc.) from my word count, and print paragraphs separately.

Thanks to @James Hurford I got this code:

words = None
with open('data.txt') as f:
   words = f.read().split()
total_words = len(words)
print 'Total words:', total_words

in_para = False
para_type = None
paragraph = list()
for word in words:
  if ('P1' in word or
      'P2' in word or
      'P3' in word ):
      if in_para == False:
         in_para = True
         para_type = word
      else:
         print 'Words in paragraph', para_type, ':', len(paragraph)
         print ' '.join(paragraph)
         del paragraph[:]
         para_type = word
  else:
    paragraph.append(word)
else:
  if in_para == True:
    print 'Words in last paragraph', para_type, ':', len(paragraph)
    print ' '.join(paragraph)
  else:
    print 'No words'

My text file looks like this:

P1: Bla bla bla.

P2: Bla bla bla bla.

P1: Bla bla.

P3: Bla.

The next part I need to do is summing up the words for each participant. I can only print them, but I don't know how to return/reuse them.

I would need a new variable with word count for each participant that I could manipulate later on, in addition to summing up all the words said by each participant, e.g.

P1all = sum of words in paragraph

Is there a way to count "you're" or "it's" etc. as two words?

Any ideas how to solve it?

解决方案

Congrats on beginning your adventure with Python! Not everything in this post might make sense right now but bookmark it and comeback to it if it seems helpful later. Eventually you should try to move from scripting to software engineering, and here are a few ideas for you!

With great power comes great responsibility, and as a Python developer you need to be more disciplined than other languages which don't hold your hand and enforce "good" design.

I find it helps to start with a top-down design.

def main():
    text = get_text()
    p_text = process_text(text)
    catalogue = process_catalogue(p_text)

BOOM! You just wrote the whole program -- now you just need to back and fill in the blanks! When you do it like this, it seems less intimidating. Personally, I don't consider myself smart enough to solve very big problems, but I'm a pro at solving small problems. So lets tackle one thing at a time. I'm going to start with 'process_text'.

def process_text(text):
    b_text = bundle_dialogue_items(text)   
    f_text = filter_dialogue_items(b_text)
    c_text = clean_dialogue_items(f_text)

I'm not really sure what those things mean yet, but I know that text problems tend to follow a pattern called "map/reduce" which means you perform and operation on something and then you clean it up and combine, so I put in some placeholder functions. I might go back and add more if necessary.

Now let's write 'process_catalogue'. I could've written "process_dict" but that sounded lame to me.

def process_catalogue(p_text): 
    speakers = make_catalogue(c_text)
    s_speakers = sum_words_per_paragraph_items(speakers)
    t_speakers = total_word_count(s_speakers)

Cool. Not too bad. You might approach this different than me, but I thought it would make sense to aggregate the items, the count the words per paragraph, and then count all the words.

So, at this point I'd probably make one or two little 'lib' (library) modules to back-fill the remaining functions. For the sake you being able to run this without worrying about imports, I'm going to stick it all in one .py file, but eventually you'll learn how to break these up so it looks nicer. So let's do this.

# ------------------ #
# == process_text == #
# ------------------ #

def bundle_dialogue_items(lines):
    cur_speaker = None
    paragraphs = Counter()
    for line in lines:
        if re.match(p, line):
            cur_speaker, dialogue = line.split(':')
            paragraphs[cur_speaker] += 1
        else:
            dialogue = line

        res = cur_speaker, dialogue, paragraphs[cur_speaker]
        yield res


def filter_dialogue_items(lines):
    for name, dialogue, paragraph in lines:
        if dialogue:
            res = name, dialogue, paragraph
            yield res

def clean_dialogue_items(flines):
    for name, dialogue, paragraph in flines:
        s_dialogue = dialogue.strip().split()
        c_dialouge = [clean_word(w) for w in s_dialogue]
        res = name, c_dialouge, paragraph
        yield res

aaaand a little helper function

# ------------------- #
# == aux functions == #
# ------------------- #

to_clean = string.whitespace + string.punctuation
def clean_word(word):
    res = ''.join(c for c in word if c not in to_clean)
    return res

So it may not be obvious but this library is designed as a data processing pipeline. There several ways to process data, one is pipeline processing and another is batch processing. Let's take a look at batch processing.

# ----------------------- #
# == process_catalogue == #
# ----------------------- #

speaker_stats = 'stats'
def make_catalogue(names_with_dialogue):
    speakers = {}
    for name, dialogue, paragraph in names_with_dialogue:
        speaker = speakers.setdefault(name, {})
        stats = speaker.setdefault(speaker_stats, {})
        stats.setdefault(paragraph, []).extend(dialogue)
    return speakers



word_count = 'word_count'
def sum_words_per_paragraph_items(speakers):
    for speaker in speakers:
        word_stats = speakers[speaker][speaker_stats]
        speakers[speaker][word_count] = Counter()
        for paragraph in word_stats:
            speakers[speaker][word_count][paragraph] += len(word_stats[paragraph])
    return speakers


total = 'total'
def total_word_count(speakers):
    for speaker in speakers:
        wc = speakers[speaker][word_count]
        speakers[speaker][total] = 0
        for c in wc:
            speakers[speaker][total] += wc[c]
    return speakers

All these nested dictionaries are getting a little complicated. In actual production code I would replace these with some more readable classes (along with adding tests and docstrings!!), but I don't want to make this more confusing than it already is! Alright, for your convenience below is the whole thing put together.

import pprint
import re
import string
from collections import Counter

p = re.compile(r'(\w+?):')


def get_text_line_items(text):
    for line in text.split('\n'):
        yield line


def bundle_dialogue_items(lines):
    cur_speaker = None
    paragraphs = Counter()
    for line in lines:
        if re.match(p, line):
            cur_speaker, dialogue = line.split(':')
            paragraphs[cur_speaker] += 1
        else:
            dialogue = line

        res = cur_speaker, dialogue, paragraphs[cur_speaker]
        yield res


def filter_dialogue_items(lines):
    for name, dialogue, paragraph in lines:
        if dialogue:
            res = name, dialogue, paragraph
            yield res


to_clean = string.whitespace + string.punctuation


def clean_word(word):
    res = ''.join(c for c in word if c not in to_clean)
    return res


def clean_dialogue_items(flines):
    for name, dialogue, paragraph in flines:
        s_dialogue = dialogue.strip().split()
        c_dialouge = [clean_word(w) for w in s_dialogue]
        res = name, c_dialouge, paragraph
        yield res


speaker_stats = 'stats'


def make_catalogue(names_with_dialogue):
    speakers = {}
    for name, dialogue, paragraph in names_with_dialogue:
        speaker = speakers.setdefault(name, {})
        stats = speaker.setdefault(speaker_stats, {})
        stats.setdefault(paragraph, []).extend(dialogue)
    return speakers


def clean_dict(speakers):
    for speaker in speakers:
        stats = speakers[speaker][speaker_stats]
        for paragraph in stats:
            stats[paragraph] = [''.join(c for c in word if c not in to_clean)
                                for word in stats[paragraph]]
    return speakers


word_count = 'word_count'


def sum_words_per_paragraph_items(speakers):
    for speaker in speakers:
        word_stats = speakers[speaker][speaker_stats]
        speakers[speaker][word_count] = Counter()
        for paragraph in word_stats:
            speakers[speaker][word_count][paragraph] += len(word_stats[paragraph])
    return speakers


total = 'total'


def total_word_count(speakers):
    for speaker in speakers:
        wc = speakers[speaker][word_count]
        speakers[speaker][total] = 0
        for c in wc:
            speakers[speaker][total] += wc[c]
    return speakers


def get_text():
    text = '''BOB: blah blah blah blah
blah hello goodbye etc.

JERRY:.............................................
...............

BOB:blah blah blah
blah blah blah
blah.
BOB: boopy doopy doop
P1: Bla bla bla.
P2: Bla bla bla bla.
P1: Bla bla.
P3: Bla.'''
    text = get_text_line_items(text)
    return text


def process_catalogue(c_text):
    speakers = make_catalogue(c_text)
    s_speakers = sum_words_per_paragraph_items(speakers)
    t_speakers = total_word_count(s_speakers)
    return t_speakers


def process_text(text):
    b_text = bundle_dialogue_items(text)
    f_text = filter_dialogue_items(b_text)
    c_text = clean_dialogue_items(f_text)
    return c_text


def main():

    text = get_text()
    c_text = process_text(text)
    t_speakers = process_catalogue(c_text)

    # take a look at your hard work!
    pprint.pprint(t_speakers)


if __name__ == '__main__':
    main()

So this script is almost certainly overkill for this application, but the point is to see what (questionably) readable, maintainable, modular Python code might look like.

Pretty sure output looks something like:

{'BOB': {'stats': {1: ['blah',
                       'blah',
                       'blah',
                       'blah',
                       'blah',
                       'hello',
                       'goodbye',
                       'etc'],
                   2: ['blah',
                       'blah',
                       'blah',
                       'blah',
                       'blah',
                       'blah',
                       'blah'],
                   3: ['boopy', 'doopy', 'doop']},
         'total': 18,
         'word_count': Counter({1: 8, 2: 7, 3: 3})},
 'JERRY': {'stats': {1: ['', '']}, 'total': 2, 'word_count': Counter({1: 2})},
 'P1': {'stats': {1: ['Bla', 'bla', 'bla'], 2: ['Bla', 'bla']},
        'total': 5,
        'word_count': Counter({1: 3, 2: 2})},
 'P2': {'stats': {1: ['Bla', 'bla', 'bla', 'bla']},
        'total': 4,
        'word_count': Counter({1: 4})},
 'P3': {'stats': {1: ['Bla']}, 'total': 1, 'word_count': Counter({1: 1})}}

这篇关于如何总结对话中每个人的字数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆