UIMA RUTA-如何查找和查找使用正则表达式和组替换 [英] UIMA RUTA - how to do find & replace using regular expression and groups

查看:244
本文介绍了UIMA RUTA-如何查找和查找使用正则表达式和组替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

RUTA新手在这里.我正在使用RUTA处理文档,在开始注释之前有很多规范化工作要做.我正在尝试找到在RUTA中使用正则表达式和组对正则表达式和组进行字符序列查找和替换的最佳方法.本质上,我试图查看如何在RUTA中执行类似于String.replaceAll的操作.

RUTA newbie here. I'm processing a document using RUTA and have a lot of normalization to do before I can start annotating. I'm trying to find the best way to do a Find and Replace of sequence of characters using regular expressions and groups on the original document in RUTA. In essence, I'm trying to see how to do something similar to a String.replaceAll in RUTA.

例如,在Java中,

inputString = inputString.replaceAll( "(?i)7\\s*\\(SEVEN\\)", "7");

但是我无法找到在RUTA中实现这一目标的简单方法.

But I can't figure out a simple way to achieve this in RUTA.

谢谢

推荐答案

通常这并不简单,因为您无法在CAS中更改文档文本.

It's not simple in general because you cannot change the document text in a CAS.

UIMA Ruta中有一些功能可以修改文档,但是结果需要存储在另一个CAS视图或其他文件中.一些一般性评论:

There is some functionality in UIMA Ruta to modify the document, but the result needs to be stored in another CAS view or in an additional file. A few general comments:

  • Simple regular expression can be applied for matching on patterns like in your question: http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.language.regexprule
  • The action REPLACE enables to remember modifications.
  • The Modifier analysis engine is able to perform modification and is able to store the changed document in an additional CAS view and in an additional HMTL file (HMTL since the modifier can also add colored spans)
  • the ViewWriter anaylsis engine is able to copy a view to another view, if you want to work with "_initialView"
  • If your document contains annotations and these annotations should also be valid after the replace, then different functionality is needed. The HTMLConverter has some parameters for replacements, but I do not know if it solves your problem in this use case without more information or further testing.

这是您问题中示例的脚本:

Here's the script for the example in your question:

ENGINE utils.Modifier;
ENGINE utils.ViewWriter;
TYPESYSTEM utils.SourceDocumentInformation;
DECLARE ToReplace;

// just create an annotation
"(?i)7\\s*\\(SEVEN\\)" -> ToReplace;

// replace the text covered by all annotations with the string "7"
ToReplace{-> REPLACE("7")}; 
//... the annotation should be removed again with UNMARK before different replacements are performed...  
// it is also possible to do this in a more generic way with features and variables

// ... either store the changed text in the "modified" view and in an additional html file
Document{-> CONFIGURE(Modifier, "outputLocation" = "D:/modified/"), EXEC(Modifier)};

// ... or store the changed text in the "modified" view and in an additional xmiCAS
Document{-> EXEC(Modifier), CONFIGURE(ViewWriter, "inputView" = "modified", "output" = "../modified/"), EXEC(ViewWriter)};

只需提一下:Modfier有一些小错误,导致空格翻倍.

Just to mention: The Modfier has some small bug resulting in doubled whitespaces.

一种更通用的替代模型:

A more generic way to model the replacements could be:

DECLARE Annotation ToReplace(STRING r);
"(?i)(7)\\s*\\(SEVEN\\)" -> ToReplace ("r" = 1);
ToReplace{-> REPLACE(ToReplace.r)};

ToReplace注释现在具有附加的字符串功能,该功能存储应替换注释覆盖的文本的值. regexp表达式有一个附加的捕获组,用于在注释中指定字符串(使用捕获组的编号分配值).现在,使用REPLACE的规则更加通用,因为不需要在操作中给出实际值,但是会应用特征的值.因此,最后一条规则可以用于其他规则指定的任何替换.

The ToReplace annotations have now an additional string feature that stores the values that should replace the covered text of the annotations. The regexp expression has an additional capturing group, which is used to specify the string in the annotation (assignment of the value using the number of the capturing group). The rule with the REPLACE is now more generic since the actual value does not need to be given in the action, but the value of the feature is applied. The last rule can, therefore, be used for any replacements specified by other rules.

通常,需要在使用沙发映射的管道中指定对更改后的文本进行操作的连续替换,因为以后的规则需要对不同的视图进行操作.在UIMA Ruta工作台中,可以在单独的脚本文件中定义查找/替换,然后为每个脚本文件使用一个启动配置.启动配置能够指定输入和输出文件夹.与ViewWriter结合使用,用户可以构建一个脚本文件链,该脚本文件可在先前脚本文件的输出文件夹中运行.

Consecutive replacements that operate on the changed text need to specified in pipeline with sofa mappings in general, since later rule need to operate on different views. In the UIMA Ruta Workbench, one could define the find/replace in separate script files, and then use one launch configuration for each script file. The launch configurations are able to specify the input and output folder. Combined with the ViewWriter, the user is able to build a chain of scripts file that operate in the output folder of previous script files.

连续替换也可以在一个脚本文件中完成,但有一些限制. REPLACE操作实际上将新文本存储在每个RutaBasic批注的替换功能中.第一个RutaBasic获取完整的新字符串,另一个RutaBasic设置为空字符串.当使用修饰符创建新文本时,Ruta基本注释的覆盖文本将替换为功能部件的值,因此第一个标记将被完整的替换字符串替换,而另一个标记将被删除.知道了此过程,规则可以根据以前的替换操作并更改相应的功能值.总体而言,可以进行连续替换,但不是直接的.

Consecutive replacements can also be done in one script file, but with some restrictions. The REPLACE action actually stores the new text in the replacement feature of each RutaBasic annotation. The first RutaBasic get the complete new string and the other RutaBasic are set to the empty string. When the new text is created by the Modifier, the covered text of the Ruta basic annotations are replaced by the values of the feature, thus the first token is replaced by the complete replacement string and the other token are deleted. Knowing this procedure, rules can operate dependent of previous replacements and change the respective feature values. Overall, consecutive replacements are possible, but not straightforward.

这篇关于UIMA RUTA-如何查找和查找使用正则表达式和组替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆