循环文件及其在R中的内容 [英] Loop files and their contents in R

查看：58 发布时间：2020/5/4 4:42:15 r loops

本文介绍了循环文件及其在R中的内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

跟进>我只发布了几分钟的问题，我需要再问一个问题.上一个问题未能指出，我还必须仔细阅读每个文件的内容.换句话说，我必须遍历目录中的所有文件，并遍历每个文件的每一行.

Following up on a question I only posted minutes ago, I need to ask another question. The previous question failed to note that I also have to look through the contents of each individual file. In other words, I have to loop through all files in a directory, and through each line of each file.

每个文件名看起来都是这样.

Every file name looks like this.

airbag.WS-U-E-A.lst

.是分隔符，.lst是扩展名(可读为文本).

. is a seperator, .lst is the extension (readable as text).

每个文件每行包含数据，例如

Each file contains data per line, such as

/home/nobackup/SONAR/COMPACT/WR-U-E-A/WR-U-E-A0000075.data.ids.xml:  <sentence>ja voor den airbag op te pompen eh :p</sentence>
/home/nobackup/SONAR/COMPACT/WR-U-E-A/WR-U-E-A0000129.data.ids.xml:  <sentence>Dobby , als ze valt heeft ze dan wel al ne airbag hee</sentence>

我想要做的是，在R中创建一个新的数据集，其中包含来自所有文件的数据.理想情况下，它看起来应该像这样:

What I want to do is, in R create a new dataset that contains data from all files. Ideally it would look like this:

ID | filename             | word | component | left-context                               | right-context
----------------------------------------------------------------------------------------------------------------
1    airbag.WS-U-E-A.lst   airbag   WS-U-E-A    ja voor den                                  op te pompen eh :p
2    airbag.WS-U-E-A.lst   airbag   WS-U-E-A    Dobby , als ze valt heeft ze dan wel al ne   hee

ID只是行的ID，可以这样操作:

ID is simply the row's id, can be done like so:

row.names <- "id"

filename是文件名(很明显)，我可以这样做:

filename is the name of the file (obviously), which I can do like so:

files <- list.files(pattern="*.lst", full.names=T, recursive=FALSE)
d <- data.frame(fileName = unname(sapply(files, basename)))

然后我可以从文件名中删除word和component

I can then strip the word and component from the filename

d$word <- gsub("\\..+", "", d$fileName, perl=TRUE)
d$component <- gsub("^[^.]+.", "", d$fileName, perl=TRUE)
d$component <- gsub(".lst$", "", d$component, perl=TRUE)

现在出现了我还没有想到的困难部分...

Now comes the hard part that I haven't figured out yet...

我上面写下的所有命令都可以通过 only 循环文件并获取其文件名来完成.但是，正如我所说，每个文件包含多个句子，需要剖析它们并放在不同的行中.请参阅上面的示例.您会看到文件名，单词和组件是相同的-左右上下文却不同.那是因为它们是同一文件中的两个不同的句子.

All the commands I have written down above can be done by only looping the files and get their filename. However, as I said each file contains multiple sentences that I need to dissect and put on different rows. See example above. You'll see that the filename, the word and the component are identical - yet the left and right context aren't. That's because they are two different sentences, in the same file.

也许带有两个文件的示例使我的问题更清楚了.

Maybe an example with two files makes my question clearer.

adapter.WR-P-P-F.lst

/home/nobackup/SONAR/COMPACT/WR-P-P-F/WR-P-P-F0000026.data.ids.xml:  <sentence>Een aanpassingseenheid ( adapter ) , aangebracht in een behuizing voornamelijk bestaande uit in- en uitvoereenheden , een koppeleenheid , een geheugeneenheid , een besturingseenheid ( met actieve en passieve elementen en monolitische geïntegreerde schakelingen ) en een elektrische voedingseenheid . &gt;</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-P-F/WR-P-P-F0000026.data.ids.xml:  <sentence>ID=&quot;1&quot;&gt;Het toestel ( adapter ) draagt zorg voor de overbrenging van gegevens , met een snelheid van 10 Mbps ( megabits per seconde ) , tussen meerdere automatische gegevensverwerkende machines in een digitaal netwerk . &quot; &gt;</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-P-F/WR-P-P-F0000034.data.ids.xml:  <sentence>Overwegende dat deze sensoren niet zijn ontworpen op de installatie van een gepantserde kabel ; dat de mogelijkheid moet worden geboden dat de gepantserde kabel niet verplicht wordt gesteld voor de aansluiting tussen de sensor en de adapter , maar alleen van de adapter naar het controleapparaat ; dat het bijgevolg noodzakelijk is de verordening dienovereenkomstig te wijzigen ;</sentence>

airbag.WS-U-E-A.lst

/home/nobackup/SONAR/COMPACT/WR-U-E-A/WR-U-E-A0000075.data.ids.xml:  <sentence>ja voor den airbag op te pompen eh :p</sentence>
/home/nobackup/SONAR/COMPACT/WR-U-E-A/WR-U-E-A0000129.data.ids.xml:  <sentence>Dobby , als ze valt heeft ze dan wel al ne airbag hee</sentence>

如果这是我目录中仅有的两个文件，我的R命令将执行以下操作:

If those were the only two files in my directory, my R commands would do the following things:

浏览每个单独的文件
将每个句子(即每一行)放在新行中
根据句子所在的文件，填写文件名，单词和组件
使用正则表达式从句子中获取左右上下文
为每行分配ID

输出将如下所示

ID | filename             | word | component | left-context                               | right-context
----------------------------------------------------------------------------------------------------------------
1    adapter.WR-P-P-F.lst  adapter  WR-P-P-F    Een aanpassingseenheid (                     ) , aangebracht in een behuizing voornamelijk bestaande uit in- en uitvoere[...]
2    adapter.WR-P-P-F.lst  adapter  WR-P-P-F    ID=&quot;1&quot;&gt;Het toestel (            ) draagt zorg voor de overbrenging van gegevens [...]
3    adapter.WR-P-P-F.lst  adapter  WR-P-P-F    [...] tussen de sensor en de                 naar het controleapparaat ; [...]
4    airbag.WS-U-E-A.lst   airbag   WS-U-E-A    ja voor den                                  op te pompen eh :p
5    airbag.WS-U-E-A.lst   airbag   WS-U-E-A    Dobby , als ze valt heeft ze dan wel al ne   hee

(为简洁起见，我遗漏了一些内容，以 [...] 表示)

(I left out some content for brevity's sake, denoted by [...])

我知道这似乎是一个很大的问题，但是基本上我需要一种循环文件本身的方法，并将每行提取一行到新行中，同时将有关文件本身的信息放在单独的列中(在同一行).我应该能够自己从行中提取文本.例如，如果我能得到这样的东西，它将带给我很长的路要走:

I understand that this seems like quite a large question, however basically what I need is a way to loop the files themselves, and extract line per line into a new row whilst putting information about the file itself in separate columns (on the same row). Extracting the text from the lines is something I should be able to do by myself. For example, it would bring me a long way if I could just get something such as this:

ID | filename             | word | component | sentence
----------------------------------------------------------------------------------------------------------------
1    adapter.WR-P-P-F.lst  adapter  WR-P-P-F   /home/nobackup/SONAR/COMPACT/WR-P-P-F/WR-P-P-F0000026.data.ids.xml:  <sentence>Een aanpassingseenheid ( adapter ) , aangebracht in een behuizing voornamelijk bestaande uit in- en uitvoereenheden , een koppeleenheid , een geheugeneenheid , een besturingseenheid ( met actieve en passieve elementen en monolitische geïntegreerde schakelingen ) en een elektrische voedingseenheid . &gt;</sentence>
2    adapter.WR-P-P-F.lst  adapter  WR-P-P-F   /home/nobackup/SONAR/COMPACT/WR-P-P-F/WR-P-P-F0000026.data.ids.xml:  <sentence>ID=&quot;1&quot;&gt;Het toestel ( adapter ) draagt zorg voor de overbrenging van gegevens , met een snelheid van 10 Mbps ( megabits per seconde ) , tussen meerdere automatische gegevensverwerkende machines in een digitaal netwerk . &quot; &gt;</sentence>
3    adapter.WR-P-P-F.lst  adapter  WR-P-P-F   /home/nobackup/SONAR/COMPACT/WR-P-P-F/WR-P-P-F0000034.data.ids.xml:  <sentence>Overwegende dat deze sensoren niet zijn ontworpen op de installatie van een gepantserde kabel ; dat de mogelijkheid moet worden geboden dat de gepantserde kabel niet verplicht wordt gesteld voor de aansluiting tussen de sensor en de adapter , maar alleen van de adapter naar het controleapparaat ; dat het bijgevolg noodzakelijk is de verordening dienovereenkomstig te wijzigen ;</sentence>
4    airbag.WS-U-E-A.lst   airbag   WS-U-E-A   /home/nobackup/SONAR/COMPACT/WR-U-E-A/WR-U-E-A0000075.data.ids.xml:  <sentence>ja voor den airbag op te pompen eh :p</sentence>
5    airbag.WS-U-E-A.lst   airbag   WS-U-E-A   /home/nobackup/SONAR/COMPACT/WR-U-E-A/WR-U-E-A0000129.data.ids.xml:  <sentence>Dobby , als ze valt heeft ze dan wel al ne airbag hee</sentence>

我希望我想说的很清楚.如果不愿意的话.

I hope it's clear what I am trying to say. If not feel free to ask.

循环文件及其在R中的内容 [英] Loop files and their contents in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

循环文件及其在R中的内容 [英] Loop files and their contents in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭