设计模式应用于文本归一化链 [英] Design pattern to apply over text normalizers chain

查看:121
本文介绍了设计模式应用于文本归一化链的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个程序可以定期接收包含我一行处理的多行的文件。为了处理这些行,我开发了一些文本规范化器,可以在线上进行转换。例如,一个规范化可以删除阻止词,语法更正,删除url等。



用于给定文件的规范化程序必须动态决定,以便我可以更改数量并改变其秩序。对于一些文件,我只需要删除,例如,禁用词,但其他需要更多的规范化程序,在某些情况下,我必须申请一次。



我的第一个想法是组织代码是应用责任链模式。在这种情况下,我会有这样的东西:





从图中可以看出,依次使用三个规范化器,之后再次使用第一个规范化器。这只是一个例子。在其他escónario中,我可以有7个Normalizer没有重复,而在其他第二个normalizer将在第三个之前执行。所以,主要的想法是有多个规范化器,并定义一个链条,我可以在其中重复一次。



我的这个方法的问题是链的所有成员总是执行(不拒绝条件),我有几千条线来处理,所以我不想花费大量的时间迭代链。



所以,我的问题是,什么是最好的方式来实现我需要的,让我有机会添加新的规范化器,而不必重写代码,并保持高速链式迭代?


解决方案

如果您正在执行所有操作,请执行以下操作:每一行的规范化器,设计模式是一个命令列表,因为没有检查每个规范器的责任。



据了解,归一化器的列表是常量文件,所以创建它不是一个问题。还有你说你每次迭代所有的行,所以唯一可以调整性能的是迭代本身。



我会使用这样的设计:
1)所有的规范化器都实现了一个通用的界面

  interface Normalizer {
String normalize(String line);
}

你很可能已经有这样的东西了。



2)打开文件(或开始处理它)时,确定需要哪个规范化程序。除非你的文件很短,而且你的文件很多,那么你怎么做也不重要。您可以有一个工厂返回一些适用于某些标准的规范化列表。它可以使用类名的文本列表或创建硬编码的命令列表。还要考虑Joop Eggen的答案。

  class Factory {
列表< Normalizer> buildNormalizers(DeterminingCriteria criteria){...}
}

如果您需要更改列表而不重新部署,则具有类名列表的文本文件是方便的。如果您还需要在此时添加一个新的规范化程序,那么您需要更改代码,因此,构建归一化器列表的类也将很好。



由于归一化器需要在这里无状态,因此可以在列表中重用双规范化器。实际上,您可以对所有文件重新使用所有规范化程序,除非您的应用程序一次又一次地为每个文件启动。由于您的命令是无状态的,因此如果需要,它们也可以在不同的文件上同时工作。也许使用这样的设计:

  class Factory {
private Map< Criteria,Normalizer> cachedNormalizers;
public Factory(){
//从主映射中创建所有规范化程序
//或硬编码,并添加到映射。
}
列表< Normalizer> buildNormalizers(DeterminingCriteria criteria){
//根据您需要的标准,创建一个空列表并从
//缓存地图获取规范化程序。
}
}

3)然后在主代码中,你只需要迭代这些对于每一行。列表的迭代应该很快,就像这样(伪代码):

 列表< Normalizer> normalizers = factory.buildNormalizers(currentFileCriteria); 
for(String line in line){
String currentLine = line;
for(Normalizer n in normalizers){
currentLine = n.normalize(currentLine);
}
doSomethingWithFinished(currentLine);
}


I have a program that receives periodically files containing multiple lines that I process line by line. To process these lines I developed some text normalizers that work over the line to transform it. For example, one normalization could be remove stopwords, grammar correction, remove urls, etc.

The normalizers used for a given file must be decided dynamically so I can change the amount of them and alter its order also. For some files I only have to remove, for example, stopwords, but others require more normalizers and in some cases I must apply one twice.

My first idea to organize the code was apply Chain of Responsibility pattern. In this case I would have something like this:

As you can see in the diagram, three normalizers are used in order, and after that the first normalizer is used again. This is only an example. In other escenario I could have 7 normalizers without repetitions and in other the second normalizer would be executed before the third. So, the main idea is to have multiple normalizers and define a chain dinamically in which I could have repetitions.

My problem with this approach is that all the members of the chain are executed always (no reject condition) and I have thousand and thousand of lines to process so I don't want to spend a lot of time iterating over the chain.

So, my question is, what is the best way to implement what I need giving me the chance to add new normalizers without have to rewrite code and keeping a high speed in chain iteration?

If you need more information just ask for it and I'll edit the question.

解决方案

When you are executing all normalizers for each line, the design pattern is a list of Commands, as there is no checking of responsibility for each normalizer.

As I understand the list of normalizers is constant per file, so creating it is not an issue. Also you said that you iterate all of them for each line, so the only thing you can tweak performance is the iteration itself.

I would use a design like this: 1) all normalizers implement a common interface

interface Normalizer {
  String normalize(String line);
}

You most likely have something like that in place already.

2) When opening the file (or starting to process it) you determine which normalizers you need. Unless your files are short and you have many of them, it is not important how you do that. You could have a factory that returns the proper list of normalizers for some criteria. It could use a textual list of class names or create the list of commands hard-coded. Also consider Joop Eggen's answer here.

class Factory {
  List<Normalizer> buildNormalizers(DeterminingCriteria criteria) { ... } 
}

If you have the need to change the list without redeploying, then a text file with a list of class names is handy. If you also need to add a new normalizer at this time you need to change code anyway, so a class that builds the list of normalizers would be fine as well.

As the normalizers need to be stateless here, you can reuse double normalizers in the list. In fact you can reuse all normalizers for all files unless your application is started for each file again and again. As your commands are state-less they are also working concurrently on different files, if needed. Maybe use a design like that:

  class Factory {
    private Map<Criteria, Normalizer> cachedNormalizers;
    public Factory() {
      // create all normalizers from a master map 
      // or hard coded here and add to map.
    }
    List<Normalizer> buildNormalizers(DeterminingCriteria criteria) { 
      // create an empty list and get normalizers from 
      // cached map depending on criterias you need.   
    } 
  }

3) Then in the main code you just iterate these for each line. Iteration the list should be pretty fast, like that (pseudocode):

List<Normalizer> normalizers = factory.buildNormalizers(currentFileCriteria);
for (String line in lines) {
  String currentLine = line;
  for (Normalizer n in normalizers) {
    currentLine = n.normalize(currentLine);
  }
  doSomethingWithFinished(currentLine);
}

这篇关于设计模式应用于文本归一化链的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆