根据标题将文本转换为data.frame [英] Converting text to a data.frame based on headers
问题描述
我将.txt
文件上传到R
中,如下所示:Election_Parties <- readr::read_lines("Election_Parties.txt")
.假设文件中包含以下文本:
I uploaded a .txt
file in to R
as follows: Election_Parties <- readr::read_lines("Election_Parties.txt")
. Let's say the following text was in the file:
BOLIVIA
P17-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento Nacionalista Revolucionario
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colombia)
换句话说:每行空白之后,就会有一个新的国家/地区开始.我想将这个文本文件转换成一个数据框,在该数据框中,国家名称变成一个向量,而参与者列表变成一个向量.
In words: After every empty line, a new country starts. I would like to convert this text file into a dataframe where the country name becomes a vector and the list of parties becomes a vector.
所需的输出:
Bolivia P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento Nacionalista
Bolivia P19-Liberty and Justice (Libertad y Justicia [LJ])
Bolivia P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
Colombia P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
Colombia P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
Colombia P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colombia)
如果可能的话,我希望解决方案基于标题.
I would if possible like the solution to be based on the header.
我刚刚意识到每个新的国家/地区都以P1
开头,因此也可以基于此为解决方案.
I just realised that every new country starts with P1
, so a solution could also be based on that.
推荐答案
如果分隔符始终为",那么一旦将文本放在矢量中,就可以使用分隔符.用它作为分界符,并累加起来将它们分成几组.
If your separator is always "", then once you have your text in a vector; use that as a demarcator and do cumsum to separate them into groups.
TXT = readr::read_lines("Election_Parties.txt")
#we add a separator for your first country
TXT = c("",TXT)
idx <- cumsum(TXT=="")
# use idx <- cumsum(!grepl("^[A-Z]",TXT)) if weird newline
您可以看到玻利维亚进入1,哥伦比亚进入2
You can see BOLIVIA goes into 1, COLOMBIA goes into 2
tibble::tibble(TXT,idx)
# A tibble: 10 x 2
TXT idx
<chr> <int>
1 "" 1
2 BOLIVIA 1
3 "P17-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimie… 1
4 P19-Liberty and Justice (Libertad y Justicia [LJ]) 1
5 P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tup… 1
6 "" 2
7 COLOMBIA 2
8 P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19]) 2
9 P2-National Popular Alliance (Alianza Nacional Popular [ANAPO]) 2
10 P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colomb… 2
我们只对每个组应用一个函数并制作一个数据框
We just apply a function to each group and make a dataframe
func = function(x){
data.frame(Country=x[2],Parties=x[3:length(x)])
}
do.call(rbind,by(TXT,idx,func))
这篇关于根据标题将文本转换为data.frame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!