如何计算R中的文本行? [英] How to Count Text Lines in R?

查看:74
本文介绍了如何计算R中的文本行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用R来计算文本中不同发言者的讲话次数(这是议会讲话记录的副本)。基本文字如下:

I would like to calculate the number of lines spoken by different speakers from a text using R (it is a transcript of parliamentary speaking records). The basic text looks like:

MR. JOHN: This activity has been going on in Tororo and I took it up with the office of the DPC. He told me that he was not aware of it.
MS. SMITH: Yes, I am aware of that. 
MR. LEHMAN: Therefore, I am seeking your guidance, Madam Speaker, and requesting that you re-assign the duty.  
MR. JOHN: Thank you

在文档中,每个说话者都有一个以MR / MS开头的标识符,始终大写。我想创建一个数据集,该数据集会计算文档中每次讲话的每个讲话者讲话的行数,以使上述文本产生以下结果:

In the documents, each speaker has an identifier that begins with MR/MS and is always capitalized. I would like to create a dataset that counts the number of lines spoken for each speaker for each time spoke in a document such that the above text would result in:

MR. JOHN: 2
MS. SMITH: 1
MR. LEHMAN: 2
MR. JOHN: 1

感谢使用R的指针!

推荐答案

您可以使用模式分割字符串,然后使用 table

You can use the pattern : to split the string by and then use table:

table(sapply(strsplit(x, ":"), "[[", 1))
#   MR. JOHN MR. LEHMAN  MS. SMITH 
#          2          1          1 




strsplit-在以下位置分割字符串并生成一个列表,

用[[-选择列表的第一部分元素

表-获取频率

strsplit - splits strings at : and results in a list
sapply with [[ - selects the first part element of the list
table - gets the frequency

编辑:以下是OP的评论。您可以将成绩单保存在文本文件中,并使用 readLines 读取R中的文本。

Following OP's comment. You can save the transcripts in a text file and use readLines to read the text in R.

tt <- readLines("./tmp.txt")

现在,我们将不得不找到一种模式,通过该模式可以仅针对那些带有讲话者姓名的行过滤此文本。我可以根据您在链接的笔录中看到的内容想到两种方法。

Now, we'll have to find a pattern by which to filter this text for just those lines with the names of those who're speaking. I can think of two approaches based on what I saw in the transcript you linked.


  • 检查,然后向后看 来查看它是否为 AZ [:punct:] (即,如果之前出现的字符是大写字母字母或任何标点符号-这是因为其中一些在之前有

  • Check for a : and then lookbehind the : to see if it is any of A-Z or [:punct:] (that is, if the character occurring before the : is any of the capital letters or any punctuation marks - this is because some of them have a ) before the :).

您可以使用 strsplit ,然后使用 sapply (如下所示)

You can use strsplit followed by sapply (as shown below)

使用strsplit:

# filter tt by pattern
tt.f <- tt[grepl("(?<=[A-Z[:punct:]]):", tt, perl = TRUE)]
# Now you should only have the required lines, use the command above:

out <- table(sapply(strsplit(tt.f, ":"), "[[", 1))

还有其他可能的方法(使用 gsub 表示ex :)或其他模式。但这应该使您对该方法有所了解。如果模式应该不同,则只需更改它即可捕获所有必需的行。

There are other approaches possible (using gsub for ex:) or alternate patterns. But this should give you an idea of the approach. If the pattern should differ, then you should just change it to capture all required lines.

当然,这是假定没有其他行,例如,像这样:

Of course, this assumes that there is no other line, for example, like this:

"Mr. Chariman, whatever (bla bla): It is not a problem"

因为我们的模式将对给出TRUE:。如果文字中出现这种情况,则您必须找到更好的模式。

Because our pattern will give TRUE for ):. If this happens in the text, you'll have to find a better pattern.

这篇关于如何计算R中的文本行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆