R通过选择的rownumbers动态分割/数据帧子集 - 分析textgrid praat [英] R Dynamic split/subset of dataframe by selected rownumbers- Parsing textgrid praat

查看:341
本文介绍了R通过选择的rownumbers动态分割/数据帧子集 - 分析textgrid praat的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试处理一个称为 .TextGrid (由Praat程序生成)的分割文件。 )



原始格式如下所示:

 文件类型= ooTextFile
对象类=TextGrid
xmin = 0
xmax = 243.761375
层级? < exists>
size = 17
item []:
item [1]:
class =IntervalTier
name =phones
xmin = 0
xmax = 243.761
间隔:size = 2505
间隔[1]:
xmin = 0
xmax = 0.4274939687384032
text =_
间隔[2]:
xmin = 0.4274939687384032
xmax = 0.472
text =v
间隔[3]:
[...]

(然后重复到EOF,间隔[3到n]为n项(注释层)一个文件。



有人使用 rPython R软件包提出了解决方案。



不幸的是:




  • 我对Python不太了解

  • rPython的版本不适用于我使用的R.3.0.2

  • 我的目标是为我的分析开发此解析器在R环境下。






现在我的目的是将这个文件分割成多个数据帧。每个数据帧应包含一个项目(注释层)。

 #加载数据
txtgrid< - read.delim(./ xxx_01_xx.textgrid,sep = c(=,\\\
),dec =。,header = FALSE)
#删除空格(使用stringr包)
txtgrid [,1]< - str_trim txtgrid [,1])$ ​​b $ b#将row.names转换为数字
num.row< - as.numeric(row.names(txtgrid))
#重新定义原始的textgrid并添加这些行(我想保留以备以后处理)
txtgrid< - data.frame(num.row,txtgrid)
colnames(txtgrid)< - c(num.row object,value)
head(txtgrid)

head(txtgrid)是非常原始的,所以这里是textgrid的前20行 txtgrid [1:20,]

  num.row对象值
1 1文件类型ooTextFile
2 2对象类TextGrid
3 3 xmin 0
4 4 xmax 243.761 375
5 5层? < exists>
6 6尺寸17
7 7项[]:
8 8项[1]:
9 9 class IntervalTier
10 10名称手机
11 11 xmin 0
12 12 xmax 243.761
13 13间隔:大小2505
14 14间隔[1]:
15 15 xmin 0
16 16 xmax 0.4274939687384032
17 17文本_
18 18间隔[2]:
19 19 xmin 0.4274939687384032
20 20 xmax 0.472

现在我已经预处理了,我可以:

 #查找要分割的行数(即项目)
tier.begining< - txtgrid [grep(item ,txtgrid $ object,perl = TRUE)]]
#将这些数字保存在一个变量
x< - as.numeric(row.names(tier.begining))

此变量 x 给出了我的数据应该是数字-1拆分成几个数据帧。



我有18个项目-1(第一个项目是项目[],并包含所有其他项目)所以矢量 x 是:

  x 
[1] 7 8 10034 14624 19214 22444 25674 28904 31910 35140 38146 38156 38566 39040 39778 40222 44800
[18] 45018

/ strong>:在多个数据框 textgrids $ nameoftheItem 中分割此数据框,使得我获得的数据帧数量与我有一样的数量,例如: p>

  textgrid $ phones 
item [1]:
class =IntervalTier
name =手机
xmin = 0
xmax = 243.761
间隔:size = 2505
间隔[1]:
xmin = 0
xmax = 0.4274939687384032
text =_
间隔[2]:
xmin = 0.4274939687384032
xm ax = 0.472
text =v
[...]
间隔[n]:
textgrid $ syllable
项[2]:
class =IntervalTier
name =syllable
xmin = 0
xmax = 243.761
间隔:size = 1200
间隔[1]:
xmin = 0
xmax = 0.500
text =ve
间隔[2]:
[...]
间隔[n]:
textgrid $ item [n]






我想使用

  txtgrid.new<  -  split(txtgrid,f = x)
/ pre>

但是这个消息是正确的:

 警告信息:在split.default(x = seq_len(nrow(x)),f = f,drop = drop,...):数据长度不是拆分变量的倍数
pre>

我没有得到所需的输出,似乎行号



我还尝试了一些,其中 daply (来自 plyr )& 子集功能,但从来没有让他们正常工作!



我很欢迎任何想法来正确构建这些数据有效率的。理想情况下,我应该能够在它们之间链接项目(注释层)(xmin和不同层的xmax)以及多个textgrid文件,这只是开始。

解决方案

split vector的长度应等于的 data.frame 中的行。



尝试以下操作:

  txtgrid.sub<  -  txtgrid [ - (1:grep(item,txtgrid $ object)[1])]] 

grep(item,txtgrid.sub $ object)[ - 1]

拆分< - unlist(mapply(rep,seq_along(grep(item,txtgrid.sub $ object)),
diff(c(grep(item,txtgrid.sub $ object)
nrow(txtgrid.sub)+ 1))))

df.list< - split(txtgrid.sub,list(split))






编辑:



然后,您可以通过执行以下操作简化数据:

  l < -  lapply(df.list ,function(x){
tmp< - as.data.frame(t(x [,3,drop = FALSE]),stringsAsFactors = FALSE)
names(tmp)< - make。唯一(make.names(x [,2]))
tmp
})

库(plyr)
do.call(rbind.fill,l)


item..1 ..类名xmin xmax间隔..size
1< NA> IntervalTier手机0 243.761 2505
2< NA> IntervalTier音节0 243.761 2505
间隔..1 .. xmin.1 xmax.1文本间隔..2 ..
1< NA> 0 0.4274939687384032 _< NA>
2< NA> 0 0.4274939687384032 _< NA>
xmin.2 xmax.2
1 0.4274939687384032 0.472
2< NA> < NA>

注意:我使用了上述的虚拟数据。


I am trying to process a "segmentation file" called .TextGrid (generated by Praat program). )

The original format looks like this:

File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0 
xmax = 243.761375 
tiers? <exists> 
size = 17 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "phones" 
        xmin = 0 
        xmax = 243.761 
        intervals: size = 2505 
        intervals [1]:
            xmin = 0 
            xmax = 0.4274939687384032 
            text = "_" 
        intervals [2]:
            xmin = 0.4274939687384032 
            xmax = 0.472 
            text = "v" 
        intervals [3]:
[...]

(This is then repeted to EOF, with intervals[3 to n] for n Item (layer of annotation) in a file.

Somebody proposed a solution using rPython R package.

Unfortunately :

  • I don't have a good knowledge of Python
  • The version of rPython is not available for R.3.0.2 (which I am using).
  • My aim is to develop this parser for my analysis exclusively under R environment.

Right now my aim is to segment this file into multiple data frame. Each dataframe should contain one item (layer of annotation).

# Load the Data
txtgrid <- read.delim("./xxx_01_xx.textgrid", sep=c("=","\n"), dec=".", header=FALSE)
# Erase White spaces (use stringr package)
txtgrid[,1] <- str_trim(txtgrid[,1])
# Convert row.names to numeric 
num.row<- as.numeric(row.names(txtgrid))
# Redefine the original textgrid and add those rows (I want to "keep them in case for later process)
txtgrid <- data.frame(num.row,txtgrid)
colnames(txtgrid) <- c("num.row","object", "value")
head(txtgrid)

The output of head(txtgrid) is very raw, so here is the first 20 lines of the textgrid txtgrid[1:20,]:

   num.row          object                value
1        1       File type           ooTextFile
2        2    Object class             TextGrid
3        3            xmin                   0 
4        4            xmax          243.761375 
5        5 tiers? <exists>                     
6        6            size                  17 
7        7        item []:                     
8        8       item [1]:                     
9        9           class        IntervalTier 
10      10            name              phones 
11      11            xmin                   0 
12      12            xmax             243.761 
13      13 intervals: size                2505 
14      14  intervals [1]:                     
15      15            xmin                   0 
16      16            xmax  0.4274939687384032 
17      17            text                   _ 
18      18  intervals [2]:                     
19      19            xmin  0.4274939687384032 
20      20            xmax               0.472 

Now that I pre-processed it, I can :

# Find the number of the rows where I want to split (i.e. Item)
tier.begining <- txtgrid[grep("item", txtgrid$object, perl=TRUE), ]
# And save those numbers in a variable
x <- as.numeric(row.names(tier.begining))

This variable x gives me the numbers-1 where my Data should be splitted in several dataframes.

I have 18 items -1 (the first item is item[] and include all the other items. So vector x is :

     x
    [1]     7     8 10034 14624 19214 22444 25674 28904 31910 35140 38146 38156 38566 39040 39778 40222 44800
[18] 45018

How can I tell to R : to segment this dataframe in multiple dataframes textgrids$nameoftheItem in such a way that I get as many data frame as I have of items?, for example :

textgrid$phones
         item [1]:
            class = "IntervalTier" 
            name = "phones" 
            xmin = 0 
            xmax = 243.761 
            intervals: size = 2505 
            intervals [1]:
            xmin = 0 
            xmax = 0.4274939687384032 
            text = "_" 
            intervals [2]:
            xmin = 0.4274939687384032 
            xmax = 0.472 
            text = "v" 
            [...]
            intervals [n]:
textgrid$syllable
    item [2]:
            class = "IntervalTier" 
            name = "syllable" 
            xmin = 0 
            xmax = 243.761 
            intervals: size = 1200
            intervals [1]:
            xmin = 0 
            xmax = 0.500
            text = "ve" 
            intervals [2]:
            [...]
            intervals [n]:
    textgrid$item[n]


I wanted to use

txtgrid.new <- split(txtgrid, f=x)

But this message is right :

Warning message: In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : data length is not a multiple of split variable

I don't get the desired outputed, it seems that row numbers don't follow each other and that the file is all mixed up.

I have also tried some which, daply (from plyr) & subset functions but never got them to work properly!

I am welcoming any idea to structure this data properly & efficiently. Ideally I should be able to link items (layers of annotation) between them (xmin & xmax of different layers), as well as multiple textgrid files, this is just the beginning.

解决方案

The length of the split vector should be equal to the number of rows in the data.frame.

Try the following:

txtgrid.sub <- txtgrid[-(1:grep("item", txtgrid$object)[1]), ]

grep("item", txtgrid.sub$object)[-1]

splits <- unlist(mapply(rep, seq_along(grep("item", txtgrid.sub$object)),
                        diff(c(grep("item", txtgrid.sub$object), 
                               nrow(txtgrid.sub) + 1))))

df.list <- split(txtgrid.sub, list(splits))


EDIT:

You could then simplify the data by doing something like this:

l <- lapply(df.list, function(x) {
  tmp <- as.data.frame(t(x[, 3, drop=FALSE]), stringsAsFactors=FALSE)
  names(tmp) <- make.unique(make.names(x[, 2]))
  tmp
})

library(plyr)
do.call(rbind.fill, l)


  item..1..        class     name xmin    xmax intervals..size
1      <NA> IntervalTier   phones    0 243.761            2505
2      <NA> IntervalTier syllable    0 243.761            2505
  intervals..1.. xmin.1             xmax.1 text intervals..2..
1           <NA>      0 0.4274939687384032    _           <NA>
2           <NA>      0 0.4274939687384032    _           <NA>
              xmin.2 xmax.2
1 0.4274939687384032  0.472
2               <NA>   <NA>

NB: I've used dummy data for the above.

这篇关于R通过选择的rownumbers动态分割/数据帧子集 - 分析textgrid praat的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆