bash脚本导航目录子,然后.xml文件操作 [英] bash script to navigate directory substructure and then operate on .xml files

查看:155
本文介绍了bash脚本导航目录子,然后.xml文件操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我厌倦了这一点:

 用于在/ home /马蒂亚斯/工作台DIR / SUTD / nytimes_corpus / NYTimesCorpus / 2007/02 / * /
    在*的.xml F;做
        回声$ F | grep的-q_output \\ $的.xml'和;&安培;继续#跳过输出文件
        G =$(基名$ F的.xml)_output.xml
        java的-mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $ F -outputFormat inlineXML> $ G
    DONE
DONE

其基于回答这个问题时,但没有工作。

我有一个文件夹stucture使得目录中的 NYTimesCorpus 有一个目录 2007 并在该目录 01 02 03 ,等等。 ..

然后在 01 再有 01 02 03 ...

在每个终端目录中有很多.xml文件,而我要应用脚本:

 中的* .xml F;做
    回声$ F | grep的-q_output \\ $的.xml'和;&安培;继续#跳过输出文件
    G =$(基名$ F的.xml)_output.xml
    java的-mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $ F -outputFormat inlineXML> $ G
DONE

但有这么多不同的目录,每个dirctory中运行它是罕见的一种酷刑。除了 2007 我也有 2006年 2005 ,所以我非常希望做的是,一旦运行它,并有计划,只需转到自身的结构是什么。

我尝试这个至今都没有成功,也许你们中间有人知道如何实现这一目标?

感谢您的考虑。

更新

 文本文件= / scrypt.sh
OUTPUTFORMAT = inlineXML
从加载分类/home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz ...做[2.2秒]。
CRFClassifier标记在5号文件71的话,每秒959.46的话。
CRFClassifier调用在Sun 4月12日19时33分34秒HKT 2015年参数:
   -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile ./scrypt.sh -outputFormat inlineXML
    loadClassifier=/home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz


解决方案

找到是一个很好的解决方案。这听起来像所有的XML文件在同一目录下的深度,所以试试这个:

  DIR = /家庭/马蒂亚斯/工作台/ SUTD / nytimes_corpus
在$ DIR / NYTimesCorpus / * / * / * / * XML楼;做
    [[$ F == * _output.xml]&放大器;&安培;继续#跳过输出文件
    G =$ {F%的.xml} _output.xml
    java的-mx600m \\
         -cp $ DIR / NER /斯坦福大学NER - 2015年1月30日/斯坦福大学NER-3.5.1.jar \\
         edu.stanford.nlp.ie.crf.CRFClassifier \\
         -loadClassifier $ DIR / NER /斯坦福大学NER - 2015年1月30日/分类/ english.all.3class.distsim.crf.ser.gz \\
         -textFile$ F\\
         -outputFormat inlineXML> $ G
DONE

在glob模式 $ DIR / NYTimesCorpus / * / * / * / *。xml的指定想要的XML文件是低于NYTimesCorpus正好3个级别。这是错误的深度,然后修改 * / 的格局。

的数量

如果该XML文件可以在不同的深度出现,使用找到,或在bash使用:

 禁用了javascript -s globstar nullglob
在$ DIR / NYTimesCorpus / ** / * XML楼;做

参考

I tired this:

for dir in /home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2007/02/*/
    for f in *.xml ; do
        echo $f | grep -q '_output\.xml$' && continue # skip output files
        g="$(basename $f .xml)_output.xml"
        java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $f -outputFormat inlineXML > $g
    done
done

which is based on the answer to this question, but that didn't work.

I have a folder stucture such that within the directory NYTimesCorpus there is a directory 2007 and within that a directory 01 and also 02, 03, and so on...

then within 01 there is again 01,02,03,...

in each of these terminal directories there are many .xml files to which I want to apply the script:

for f in *.xml ; do
    echo $f | grep -q '_output\.xml$' && continue # skip output files
    g="$(basename $f .xml)_output.xml"
    java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $f -outputFormat inlineXML > $g
done

but there are so many different directories that running it within each dirctory is a form of rare torture. apart from 2007 I also have 2006 and 2005, so ideally what I would like to do is run it once and have the program just navigate that structure on its own.

My attempts this far have not been successful, perhaps one among you would know how to achieve this?

Thank you for your consideration.

UPDATE

textFile=./scrypt.sh
outputFormat=inlineXML
Loading classifier from /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz ... done [2.2 sec].
CRFClassifier tagged 71 words in 5 documents at 959.46 words per second.
CRFClassifier invoked on Sun Apr 12 19:33:34 HKT 2015 with arguments:
   -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile ./scrypt.sh -outputFormat inlineXML
    loadClassifier=/home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz

解决方案

find is a good solution. It sounds like all the xml files are at the same directory depth, so try this:

dir=/home/matthias/Workbench/SUTD/nytimes_corpus
for f in $dir/NYTimesCorpus/*/*/*/*.xml; do
    [[ $f == *_output.xml ]] && continue # skip output files
    g="${f%.xml}_output.xml"
    java -mx600m \
         -cp $dir/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar \
         edu.stanford.nlp.ie.crf.CRFClassifier \
         -loadClassifier $dir/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz \
         -textFile "$f" \
         -outputFormat inlineXML > "$g"
done

The glob pattern $dir/NYTimesCorpus/*/*/*/*.xml specifies that the wanted xml files are exactly 3 levels below NYTimesCorpus. That that is the wrong depth, then alter the number of */ in the pattern.

If the xml files can appear at varying depths, use find, or in bash use:

shopt -s globstar nullglob
for f in $dir/NYTimesCorpus/**/*.xml; do

reference

这篇关于bash脚本导航目录子,然后.xml文件操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆