如何计算R中的文本行? [英] How to Count Text Lines in R?
问题描述
我想用R来计算文本中不同发言者的讲话次数(这是议会讲话记录的副本)。基本文字如下:
I would like to calculate the number of lines spoken by different speakers from a text using R (it is a transcript of parliamentary speaking records). The basic text looks like:
MR. JOHN: This activity has been going on in Tororo and I took it up with the office of the DPC. He told me that he was not aware of it.
MS. SMITH: Yes, I am aware of that.
MR. LEHMAN: Therefore, I am seeking your guidance, Madam Speaker, and requesting that you re-assign the duty.
MR. JOHN: Thank you
在文档中,每个说话者都有一个以MR / MS开头的标识符,始终大写。我想创建一个数据集,该数据集会计算文档中每次讲话的每个讲话者讲话的行数,以使上述文本产生以下结果:
In the documents, each speaker has an identifier that begins with MR/MS and is always capitalized. I would like to create a dataset that counts the number of lines spoken for each speaker for each time spoke in a document such that the above text would result in:
MR. JOHN: 2
MS. SMITH: 1
MR. LEHMAN: 2
MR. JOHN: 1
感谢使用R的指针!
推荐答案
您可以使用模式:
分割字符串,然后使用 table
:
You can use the pattern :
to split the string by and then use table
:
table(sapply(strsplit(x, ":"), "[[", 1))
# MR. JOHN MR. LEHMAN MS. SMITH
# 2 1 1
strsplit-在以下位置分割字符串
:
并生成一个列表,
用[[-选择列表的第一部分元素
表-获取频率
strsplit - splits strings at
:
and results in a list
sapply with [[ - selects the first part element of the list
table - gets the frequency
编辑:以下是OP的评论。您可以将成绩单保存在文本文件中,并使用 readLines
读取R中的文本。
Following OP's comment. You can save the transcripts in a text file and use readLines
to read the text in R.
tt <- readLines("./tmp.txt")
现在,我们将不得不找到一种模式,通过该模式可以仅针对那些带有讲话者姓名的行过滤此文本。我可以根据您在链接的笔录中看到的内容想到两种方法。
Now, we'll have to find a pattern by which to filter this text for just those lines with the names of those who're speaking. I can think of two approaches based on what I saw in the transcript you linked.
- 检查
:
,然后向后看:
来查看它是否为AZ
或[:punct:]
(即,如果:
之前出现的字符是大写字母字母或任何标点符号-这是因为其中一些在:
之前有)
。
- Check for a
:
and then lookbehind the:
to see if it is any ofA-Z
or[:punct:]
(that is, if the character occurring before the:
is any of the capital letters or any punctuation marks - this is because some of them have a)
before the:
).
您可以使用 strsplit
,然后使用 sapply
(如下所示)
You can use strsplit
followed by sapply
(as shown below)
使用strsplit:
# filter tt by pattern
tt.f <- tt[grepl("(?<=[A-Z[:punct:]]):", tt, perl = TRUE)]
# Now you should only have the required lines, use the command above:
out <- table(sapply(strsplit(tt.f, ":"), "[[", 1))
还有其他可能的方法(使用 gsub
表示ex :)或其他模式。但这应该使您对该方法有所了解。如果模式应该不同,则只需更改它即可捕获所有必需的行。
There are other approaches possible (using gsub
for ex:) or alternate patterns. But this should give you an idea of the approach. If the pattern should differ, then you should just change it to capture all required lines.
当然,这是假定没有其他行,例如,像这样:
Of course, this assumes that there is no other line, for example, like this:
"Mr. Chariman, whatever (bla bla): It is not a problem"
因为我们的模式将对给出TRUE:
。如果文字中出现这种情况,则您必须找到更好的模式。
Because our pattern will give TRUE for ):
. If this happens in the text, you'll have to find a better pattern.
这篇关于如何计算R中的文本行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!