如何计算R中的文本行？ [英] How to Count Text Lines in R?

查看：74 发布时间：2020/10/10 19:51:52 r text count

本文介绍了如何计算R中的文本行？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想用R来计算文本中不同发言者的讲话次数（这是议会讲话记录的副本）。基本文字如下：

I would like to calculate the number of lines spoken by different speakers from a text using R (it is a transcript of parliamentary speaking records). The basic text looks like:

MR. JOHN: This activity has been going on in Tororo and I took it up with the office of the DPC. He told me that he was not aware of it.
MS. SMITH: Yes, I am aware of that. 
MR. LEHMAN: Therefore, I am seeking your guidance, Madam Speaker, and requesting that you re-assign the duty.  
MR. JOHN: Thank you

在文档中，每个说话者都有一个以MR / MS开头的标识符，始终大写。我想创建一个数据集，该数据集会计算文档中每次讲话的每个讲话者讲话的行数，以使上述文本产生以下结果：

In the documents, each speaker has an identifier that begins with MR/MS and is always capitalized. I would like to create a dataset that counts the number of lines spoken for each speaker for each time spoke in a document such that the above text would result in:

MR. JOHN: 2
MS. SMITH: 1
MR. LEHMAN: 2
MR. JOHN: 1

感谢使用R的指针！

推荐答案

您可以使用模式：分割字符串，然后使用 table ：

You can use the pattern : to split the string by and then use table:

table(sapply(strsplit(x, ":"), "[[", 1))
#   MR. JOHN MR. LEHMAN  MS. SMITH 
#          2          1          1

strsplit-在以下位置分割字符串：并生成一个列表，

用[[-选择列表的第一部分元素

表-获取频率

strsplit - splits strings at : and results in a list
sapply with [[ - selects the first part element of the list
table - gets the frequency

编辑：以下是OP的评论。您可以将成绩单保存在文本文件中，并使用 readLines 读取R中的文本。

Following OP's comment. You can save the transcripts in a text file and use readLines to read the text in R.

tt <- readLines("./tmp.txt")

现在，我们将不得不找到一种模式，通过该模式可以仅针对那些带有讲话者姓名的行过滤此文本。我可以根据您在链接的笔录中看到的内容想到两种方法。

Now, we'll have to find a pattern by which to filter this text for just those lines with the names of those who're speaking. I can think of two approaches based on what I saw in the transcript you linked.

检查：，然后向后看 ：来查看它是否为 AZ 或 [：punct：] （即，如果：之前出现的字符是大写字母字母或任何标点符号-这是因为其中一些在：之前有）。

Check for a : and then lookbehind the : to see if it is any of A-Z or [:punct:] (that is, if the character occurring before the : is any of the capital letters or any punctuation marks - this is because some of them have a ) before the :).

您可以使用 strsplit ，然后使用 sapply （如下所示）

You can use strsplit followed by sapply (as shown below)

使用strsplit：

# filter tt by pattern
tt.f <- tt[grepl("(?<=[A-Z[:punct:]]):", tt, perl = TRUE)]
# Now you should only have the required lines, use the command above:

out <- table(sapply(strsplit(tt.f, ":"), "[[", 1))

还有其他可能的方法（使用 gsub 表示ex :)或其他模式。但这应该使您对该方法有所了解。如果模式应该不同，则只需更改它即可捕获所有必需的行。

There are other approaches possible (using gsub for ex:) or alternate patterns. But this should give you an idea of the approach. If the pattern should differ, then you should just change it to capture all required lines.

当然，这是假定没有其他行，例如，像这样：

Of course, this assumes that there is no other line, for example, like this:

"Mr. Chariman, whatever (bla bla): It is not a problem"

因为我们的模式将对给出TRUE：。如果文字中出现这种情况，则您必须找到更好的模式。

Because our pattern will give TRUE for ):. If this happens in the text, you'll have to find a better pattern.

这篇关于如何计算R中的文本行？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何计算R中的文本行？ [英] How to Count Text Lines in R?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何计算R中的文本行？ [英] How to Count Text Lines in R?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭