bash脚本导航目录子,然后.xml文件操作 [英] bash script to navigate directory substructure and then operate on .xml files
问题描述
我厌倦了这一点:
用于在/ home /马蒂亚斯/工作台DIR / SUTD / nytimes_corpus / NYTimesCorpus / 2007/02 / * /
在*的.xml F;做
回声$ F | grep的-q_output \\ $的.xml'和;&安培;继续#跳过输出文件
G =$(基名$ F的.xml)_output.xml
java的-mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $ F -outputFormat inlineXML> $ G
DONE
DONE
其基于回答这个问题时,但没有工作。
我有一个文件夹stucture使得目录中的 NYTimesCorpus
有一个目录 2007
并在该目录 01
也 02
, 03
,等等。 ..
然后在 01
再有 01
, 02
, 03
...
在每个终端目录中有很多.xml文件,而我要应用脚本:
中的* .xml F;做
回声$ F | grep的-q_output \\ $的.xml'和;&安培;继续#跳过输出文件
G =$(基名$ F的.xml)_output.xml
java的-mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $ F -outputFormat inlineXML> $ G
DONE
但有这么多不同的目录,每个dirctory中运行它是罕见的一种酷刑。除了 2007
我也有 2006年
和 2005
,所以我非常希望做的是,一旦运行它,并有计划,只需转到自身的结构是什么。
我尝试这个至今都没有成功,也许你们中间有人知道如何实现这一目标?
感谢您的考虑。
更新
文本文件= / scrypt.sh
OUTPUTFORMAT = inlineXML
从加载分类/home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz ...做[2.2秒]。
CRFClassifier标记在5号文件71的话,每秒959.46的话。
CRFClassifier调用在Sun 4月12日19时33分34秒HKT 2015年参数:
-loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile ./scrypt.sh -outputFormat inlineXML
loadClassifier=/home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz
找到
是一个很好的解决方案。这听起来像所有的XML文件在同一目录下的深度,所以试试这个:
DIR = /家庭/马蒂亚斯/工作台/ SUTD / nytimes_corpus
在$ DIR / NYTimesCorpus / * / * / * / * XML楼;做
[[$ F == * _output.xml]&放大器;&安培;继续#跳过输出文件
G =$ {F%的.xml} _output.xml
java的-mx600m \\
-cp $ DIR / NER /斯坦福大学NER - 2015年1月30日/斯坦福大学NER-3.5.1.jar \\
edu.stanford.nlp.ie.crf.CRFClassifier \\
-loadClassifier $ DIR / NER /斯坦福大学NER - 2015年1月30日/分类/ english.all.3class.distsim.crf.ser.gz \\
-textFile$ F\\
-outputFormat inlineXML> $ G
DONE
在glob模式 $ DIR / NYTimesCorpus / * / * / * / *。xml的
指定想要的XML文件是低于NYTimesCorpus正好3个级别。这是错误的深度,然后修改 * /
的格局。
如果该XML文件可以在不同的深度出现,使用找到
,或在bash使用:
禁用了javascript -s globstar nullglob
在$ DIR / NYTimesCorpus / ** / * XML楼;做
I tired this:
for dir in /home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2007/02/*/
for f in *.xml ; do
echo $f | grep -q '_output\.xml$' && continue # skip output files
g="$(basename $f .xml)_output.xml"
java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $f -outputFormat inlineXML > $g
done
done
which is based on the answer to this question, but that didn't work.
I have a folder stucture such that within the directory NYTimesCorpus
there is a directory 2007
and within that a directory 01
and also 02
, 03
, and so on...
then within 01
there is again 01
,02
,03
,...
in each of these terminal directories there are many .xml files to which I want to apply the script:
for f in *.xml ; do
echo $f | grep -q '_output\.xml$' && continue # skip output files
g="$(basename $f .xml)_output.xml"
java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $f -outputFormat inlineXML > $g
done
but there are so many different directories that running it within each dirctory is a form of rare torture. apart from 2007
I also have 2006
and 2005
, so ideally what I would like to do is run it once and have the program just navigate that structure on its own.
My attempts this far have not been successful, perhaps one among you would know how to achieve this?
Thank you for your consideration.
UPDATE
textFile=./scrypt.sh
outputFormat=inlineXML
Loading classifier from /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz ... done [2.2 sec].
CRFClassifier tagged 71 words in 5 documents at 959.46 words per second.
CRFClassifier invoked on Sun Apr 12 19:33:34 HKT 2015 with arguments:
-loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile ./scrypt.sh -outputFormat inlineXML
loadClassifier=/home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz
find
is a good solution. It sounds like all the xml files are at the same directory depth, so try this:
dir=/home/matthias/Workbench/SUTD/nytimes_corpus
for f in $dir/NYTimesCorpus/*/*/*/*.xml; do
[[ $f == *_output.xml ]] && continue # skip output files
g="${f%.xml}_output.xml"
java -mx600m \
-cp $dir/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar \
edu.stanford.nlp.ie.crf.CRFClassifier \
-loadClassifier $dir/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz \
-textFile "$f" \
-outputFormat inlineXML > "$g"
done
The glob pattern $dir/NYTimesCorpus/*/*/*/*.xml
specifies that the wanted xml files are exactly 3 levels below NYTimesCorpus. That that is the wrong depth, then alter the number of */
in the pattern.
If the xml files can appear at varying depths, use find
, or in bash use:
shopt -s globstar nullglob
for f in $dir/NYTimesCorpus/**/*.xml; do
这篇关于bash脚本导航目录子,然后.xml文件操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!