拆分和分组纯文本(按数据框中的章节对文本进行分组)? [英] Splitting and grouping plain text (grouping text by chapter in dataframe)?

查看:115
本文介绍了拆分和分组纯文本(按数据框中的章节对文本进行分组)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框/小标题,我在其中导入了纯文本(txt)文件.文本非常一致,并按章节分组.有时章节文本只有一行,有时是多行.数据在这样的一列中:

I have a data frame/tibble where I've imported a file of plain text (txt). The text very consistent and is grouped by chapter. Sometimes the chapter text is only one row, sometimes it's multiple row. Data is in one column like this:

# A tibble: 10,708 x 1
   x                                                                     
   <chr>                                                                                                                                   
 1 "Chapter 1 "                                                          
 2 "Chapter text. "     
 3 "Chapter 2 "                                                          
 4 "Chapter text. "    
 5 "Chapter 3 "
 6 "Chapter text. "
 7 "Chapter text. "
 8 "Chapter 4 "   

我正在尝试清理数据,以便为Chapter添加一个新列,并在另一列中添加每个章节的文本,如下所示:

I'm trying to clean the data to have a new column for Chapter and the text from each chapter in another column, like this:

# A tibble: 10,548 x 2
   x                                Chapter   
   <chr>                             <chr>
 1 "Chapter text. "               "Chapter 1 "
 2 "Chapter text. "               "Chapter 2 "
 3 "Chapter text. "               "Chapter 3 " 
 4 "Chapter text. "               "Chapter 4 " 

我一直在尝试使用正则表达式对单词'Chapter#'(每章后面加一个数字,但无法获得所需的结果)进行拆分并将数据分组.但是,任何建议都非常感谢.

I've been trying to use regex to split the and group the data at each occurance of the word 'Chapter #' (chapter followed by a number, but cannot get the result I want. Any advice is much appreciated.

推荐答案

基于有时章节文本仅是一行,有时是多行" 我假设第6和7行中的文本属于第3章,测试数据中没有第4章的文本(您想要的输出可能有点错误).

Based on "Sometimes the chapter text is only one row, sometimes it's multiple row" I am assuming text in rows 6 and 7 belong to chapter 3 and there is no text for chapter 4 in your test data (Your desired output is probably a bit wrong).

这是使用dplyrtidyr的一种方式.只需逐个运行它,您就会看到数据是如何转换的.

Here's a way using dplyr and tidyr. Just run it piece-by-piece and you'll see how the data gets transformed.

df %>% 
  mutate(
    id = cumsum(grepl("[0-9].$", x)),
    x = ifelse(grepl("[0-9].$", x), paste0(x, ":"), x)
  ) %>% 
  group_by(id) %>% 
  summarize(
    chapter = paste0(x, collapse = "")
  ) %>% 
  separate(chapter, into = c("chapter", "text"), sep = ":", extra = "merge")

# A tibble: 4 x 3
     id chapter      text                          
  <int> <chr>        <chr>                         
1     1 "Chapter 1 " "Chapter text. "              
2     2 "Chapter 2 " "Chapter text. "              
3     3 "Chapter 3 " "Chapter text. Chapter text. "
4     4 "Chapter 4 " ""     

数据-

df <- structure(list(x = c("Chapter 1 ", "Chapter text. ", "Chapter 2 ", 
"Chapter text. ", "Chapter 3 ", "Chapter text. ", "Chapter text. ", 
"Chapter 4 ")), .Names = "x", class = "data.frame", row.names = c(NA, 
-8L))

这篇关于拆分和分组纯文本(按数据框中的章节对文本进行分组)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆