解析采访文本 [英] Parsing Interview Text

查看:26
本文介绍了解析采访文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个总统辩论的文本文件.最终,我想将文本解析为一个数据框,其中每一行都是一个语句,其中一列是说话者的名字,另一列是语句.例如:

I have a text file of a presidential debate. Eventually, I want to parse the text into a dataframe where each row is a statement, with one column with the speaker's name and another column with the statement. For example:

"Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"

会变成:

   name          text
1   Bob Smith    Hi Steve. How are you doing?
2 Steve Brown    Hi Bob. I'm doing well!

问题:如何从名称中拆分语句?我尝试在冒号上拆分:

Question: How do I split the statements from the names? I tried splitting on the colon:

data <- strsplit(data, split=":")

但后来我明白了:

"Bob Smith" "Hi Steve. How are you doing? Steve Brown" "Hi Bob. I'm doing well!"

当我想要的是这个时:

"Bob Smith" "Hi Steve. How are you doing?" "Steve Brown" "Hi Bob. I'm doing well!"

推荐答案

我怀疑这是否能解决您的所有解析需求,但是使用 strsplit 解决您最紧迫的问题的方法是使用环视.不过,您需要使用 perl 正则表达式.

I doubt this will fix all of your parsing needs, but an approach using strsplit to solve your most immediate question is using lookaround. You'll need to use perl regex though.

在这里,您指示 strsplit 在 : 或前面有标点符号且空格和 : 之间只有字母数字字符或空格的空格上进行拆分.\\pP 匹配标点字符,\\w 匹配单词字符.

Here you instruct strsplit to split on either : or a space where there is a punctuation character immediately before and nothing but alphanumeric characters or spaces between the space and :. \\pP matches punctuation characters and \\w matches word characters.

data <- "Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"
strsplit(data,split="(: |(?<=\\pP) (?=[\\w ]+:))",perl=TRUE)
[[1]]
[1] "Bob Smith"                    "Hi Steve. How are you doing?" "Steve Brown"                 
[4] "Hi Bob. I'm doing well!"  

这篇关于解析采访文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆