多个输入文件斯坦福大学NER preserving命名为每个输出 [英] multiple files input to stanford NER preserving naming for each output

查看:254
本文介绍了多个输入文件斯坦福大学NER preserving命名为每个输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多的文件,(在纽约时报语料库'05,'06,和放大器; '07),我想通过的斯坦福大学NER ,易,你可能会想,只要按照命令自述文档,但如果你认为刚才,你会被误认为,因为我的情况有点复杂。我不想让他们全部输出到一些大的混乱的烂摊子,我想preserve每个文件的命名结构,因此,例如,一个文件被命名为 1822873.xml 我处理它较早使用下面的命令:

 的Java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford .nlp.ie.crf.CRFClassifier -loadClassifier分类/ english.all.3class.distsim.crf.ser.gz -textFile /home/matthias/Workbench/SUTD/nytimes_corpus/1822873.xml -outputFormat inlineXML>> output.curtis

如果我是跟着<一个href=\"http://stackoverflow.com/questions/27544545/python-or-bash-script-to-pass-all-files-in-a-folder-to-java-command-line\">this问题,即许多文件中的所有其它后,在命令一家上市,然后通过管道的某处,不会只是他们都发送到同一个文件?这听起来像最高阶的头痛disastor。

有一些方法给每个文件发送到一个单独的输出文件,因此,举例来说,我们的老朋友 1822873.xml 将从这个过程中涌现作为,说 1822873.output.xml ,同样其他每个千元的一些奇怪的文件。请记住,我试图做到这一点<一个href=\"http://listenonrepeat.com/watch/?hl=en-GB&gl=SG&v=6Z66wVo7uNw?v=6Z66wVo7uNw#Curtis_Mayfield_-_Move_On_Up\"相对=nofollow>迅速的。

我想这应该是可能的,但什么是做到这一点的最好方法是什么?以某种终端命令,或者写一个小剧本?

也许你当中的一个有这种类型的东西一定的经验。

感谢您的考虑。


解决方案

更新

您可以用bash脚本<一做href=\"http://stackoverflow.com/questions/29588423/bash-script-to-navigate-directory-substructure-and-then-operate-on-xml-files\">like这。


@duhaime我试过,但是我已经和分类的问题,也有可能制定输出,作为内联XML?

对于我原来的问题,检查什么我发现


  

    

      

        

          

            

不幸的是,没有选项有多个输入文件去
            多个输出文件。你可以在当前的情况下做的最好
            是一次有每个输入文件运行CRFClassifier。如果
            您
            有一吨的小文件,加载模型将是一个昂贵的
            部分
            这个操作的,你可能想使用CRFClassifier
            服务器
            程序和饲料文件中的一个在通过客户端的时候。但是,我
            疑问,这将是值得的,除了在特定情况下
            有非常多的小文件。


            
            

我们将尝试添加为下一个分布的特征(我们
            有一个大致的修复,这一天来了),但没有承诺。


            
            

约翰


          
        
      
    
  

我的文件都在递增顺序编号,你认为这将有可能写一些bash脚本的一个循环过程,每次他们每个人之一?

I have many files, (the NYTimes corpus for '05, '06, & '07) , I want to run them all through the Stanford NER, "easy" you might think, "just follow the commands in the README doc", but if you thought that just now, you would be mistaken, because my situation is a bit more complicated. I don't want them all outputted into some big jumbled mess, I want to preserve the naming structure of each file, so for example, one file is named 1822873.xml and I processed it earlier using the following command:

java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile /home/matthias/Workbench/SUTD/nytimes_corpus/1822873.xml -outputFormat inlineXML >> output.curtis

If I were to follow this question, i.e. many files all listed in the command one after the other, and then pipe that to somewhere, wouldn't it just send them all to the same file? That sounds like a headache disastor of the highest order.

Is there some way to send each file to a seperate output file, so for instance, our old friend 1822873.xml would emerge from this process as, say 1822873.output.xml, and likewise for each of the other thousand some odd files. Please keep in mind that I'm trying to achieve this expeditiously.

I guess this should be possible, but what is the best way to do it? with some kind of terminal command, or maybe write a small script?

Maybe one among you has some experience with this type of thing.

Thank you for your consideration.

解决方案

UPDATE

you can do it with a bash script like this.


@duhaime I tried that but I had an issue with the classifier, also is it possible to formulate the output for that as inline xml?

With respect to my original question, check out what I've found:

Unfortunately, there is no option to have multiple input files go to multiple output files. The best you can do in the current situation is to run the CRFClassifier once for each input file you have. If you have a ton of small files, loading the model will be an expensive part of this operation, and you might want to use the CRFClassifier server program and feed files one at a time through the client. However, I doubt that will be worth the effort except in the specific case of having very many small files.

We will try to add this as a feature for the next distribution (we have a general fix-it day coming up) but no promises.

John

My files are all numbered in ascending order, do you think it would be possible to write some kind of bash script with a loop to processes each of them one at a time?

这篇关于多个输入文件斯坦福大学NER preserving命名为每个输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆