将invRegex.py移植到Javascript(Node.js) [英] Porting invRegex.py to Javascript (Node.js)

查看:94
本文介绍了将invRegex.py移植到Javascript(Node.js)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直试图移植 invRegex.py 到node.js实现了一段时间,但我仍在苦苦挣扎。由于 ret.js 标记器,我已经有了正则表达式解析树,它运行得很好但是,以记忆效率的方式实际生成和连接所有不同的元素对我来说是非常具有挑战性的。为了简单起见,我可以说我有以下正则表达式:

I have been trying to port invRegex.py to a node.js implementation for a while, but I'm still struggling with it. I already have the regular expression parse tree thanks to the ret.js tokenizer and it works pretty well, but the actual generation and concatenation of all the distinct elements in a way that is memory-efficient is revealing very challenging to me. To keep it simple, lets say I have the following regex:

[01]{1,2}@[a-f]

将其提供给 invRegex.py 会产生以下结果输出( tabbified 占用更少的空间):

Feeding that to invRegex.py produces the following output (tabbified to take less space):

 0@a     0@b     0@c     0@d     0@e     0@f
00@a    00@b    00@c    00@d    00@e    00@f
01@a    01@b    01@c    01@d    01@e    01@f
 1@a     1@b     1@c     1@d     1@e     1@f
10@a    10@b    10@c    10@d    10@e    10@f
11@a    11@b    11@c    11@d    11@e    11@f

考虑到我'能够获得每个单独的令牌并生成所有有效单个输出的数组:

Considering I'm able to get each individual token and produce an array of all the valid individual outputs:

[01]{1,2} = function () {
    return ['0', '00', '01', '1', '10', '11'];
};

@ = function () {
    return ['@'];
};

[a-f] = function () {
    return ['a', 'b', 'c', 'd', 'e', 'f'];
};

我可以计算笛卡儿积分并获得相同的预期输出:

I can compute the cartesian product of all the arrays and get the same expected output:

var _ = require('underscore');

function cartesianProductOf() {
    return _.reduce(arguments, function(a, b) {
        return _.flatten(_.map(a, function(x) {
            return _.map(b, function(y) {
                return x.concat([y]);
            });
        }), true);
    }, [ [] ]);
};

var tokens = [
    ['0', '00', '01', '1', '10', '11'],
    ['@'],
    ['a', 'b', 'c', 'd', 'e', 'f'],
];

var result = cartesianProductOf(tokens[0], tokens[1], tokens[2]);

_.each(result, function (value, key) {
    console.log(value.join(''));
});

这个问题在于它保留了内存中的所有36个值,如果我稍微多一点复杂的正则表达式,例如 [az] {0,10} 它会在内存中保存 146813779479511 值,这是完全不可行。我想以异步方式处理这个巨大的列表,将每个生成的组合传递给回调,并允许我在我认为合适的任何合理点上中断该过程,就像invRegex.py或这个Haskell包 - 遗憾的是我无法理解Haskell,我不知道如何模仿Python中的生成器行为到Javascript。

The problem with this is that it holds all the 36 values in memory, if I had a slightly more complicated regular expression, such as [a-z]{0,10} it would hold 146813779479511 values in memory, which is totally unfeasible. I would like to process this huge list in an asynchronous fashion, passing each generated combination to a callback and allowing me to interrupt the process at any sensible point I see fit, much like invRegex.py or this Haskell package - unfortunately I can't understand Haskell and I don't know how to mimic the generator behavior in Python to Javascript either.

我尝试在节点0.11.9(带 - 和谐)中运行几个简单的生成器实验,就像这样一个:

I tried running a couple of simple generator experiments in node 0.11.9 (with --harmony) like this one:

function* alpha() {
    yield 'a'; yield 'b'; yield 'c';
}

function* numeric() {
    yield '0'; yield '1';
}

function* alphanumeric() {
    yield* alpha() + numeric(); // what's the diff between yield and yield*?
}

for (var i of alphanumeric()) {
    console.log(i);
}

毋庸置疑,上述说法无效。 = /

Needless to say the above doesn't work. =/

在这里敲我的头,所以任何帮助解决这个问题都将受到高度赞赏。

Banging my head against the wall here, so any help tackling this problem would be highly appreciated.

UPDATE :以下是 b [az] {3} :

{
    "type": ret.types.ROOT,
    "stack": [
            {
                "type": ret.types.CHAR,
                "value": 98 // b
            },
            {
                "type": ret.types.REPETITION,
                "max": 3,
                "min": 3,
                "value": {
                    "type": ret.types.SET,
                    "not": false,
                    "set": [
                        {
                            "type": ret.types.RANGE,
                            "from": 97, // a
                            "to": 122   // z
                        }
                    ]
                }
            }
        ]
    ]
}

SET / RANGE 类型应该产生26个不同的值,而父 REPETITION type应该将之前的值取为3的幂,产生17576个不同的组合。如果我要生成一个扁平的标记数组,就像我之前为 cartesianProductOf 生成的那样,中间的展平值将为作为实际的笛卡尔积本身的空间。

The SET / RANGE type should yield 26 distinct values, and the parent REPETITION type should take that previous value to the power of 3, yielding 17576 distinct combinations. If I was to generate a flattened out tokens array like I did before for cartesianProductOf, the intermediate flattened values would take as much space as the actual cartesian product itself.

我希望这个例子能更好地解释我面临的问题。

I hope this example explains better the problem I am facing.

推荐答案

我建议你编写迭代器类。它们易于实现(基本上它们是状态机),它们具有较低的内存占用,它们可以组合起来构建越来越复杂的表达式(请向下滚动以查看最终结果),并且生成的迭代器可以包含在枚举器。

I advise you to write iterator classes. They are easy to implement (basically they are state machines), they have a low memory footprint, they can be combined to build up increasingly complex expressions (please scroll down to see the final result), and the resulting iterator can be wrapped in an enumerator.

每个迭代器类都有以下方法:

Each iterator class has the following methods:


  • 第一个:初始化状态机(第一场比赛)

  • 下一个:进入下一个状态(下一场比赛)

  • ok:最初为true,但在下一次超出最后一场比赛后变为false

  • 获取:返回当前匹配(作为字符串)

  • 克隆:克隆对象;重复的必要条件,因为每个实例都需要自己的状态

  • first: initializes the state machine (first match)
  • next: proceeds to the next state (next match)
  • ok: initially true, but becomes false once 'next' proceeds beyond the last match
  • get: returns the current match (as a string)
  • clone: clones the object; essential for repetition, because each instance needs its own state

从最简单的情况开始:一个或多个字符的序列应按字面匹配(例如 / foo / )。不用说这只有一个匹配,所以'ok'在第一次调用'next'时会变为false。

Start off with the most trivial case: a sequence of one or more characters that should be matched literally (e.g. /foo/). Needless to say this has only one match, so 'ok' will become false upon the first call of 'next'.

function Literal(literal) { this.literal = literal; }

Literal.prototype.first = function() { this.i = 0; };
Literal.prototype.next = function() { this.i++; };
Literal.prototype.ok = function() { return this.i == 0; };
Literal.prototype.get = function() { return this.literal; };
Literal.prototype.clone = function() { return new Literal(this.literal); };

字符类( [abc] )也是微不足道的。构造函数接受一串字符;如果你喜欢数组,那很容易修复。

Character classes ([abc]) are trivial too. The constructor accepts a string of characters; if you prefer arrays, that's easy to fix.

function CharacterClass(chars) { this.chars = chars; }

CharacterClass.prototype.first = function() { this.i = 0; };
CharacterClass.prototype.next = function() { this.i++; };
CharacterClass.prototype.ok = function() { return this.i < this.chars.length; };
CharacterClass.prototype.get = function() { return this.chars.charAt(this.i); };
CharacterClass.prototype.clone = function() { return new CharacterClass(this.chars); };

现在我们需要结合其他迭代器的迭代器来形成更复杂的正则表达式。一个序列只是一行中的两个或多个模式(如 foo [abc] )。

Now we need iterators that combine other iterators to form more complex regular expressions. A sequence is just two or more patterns in a row (like foo[abc]).

function Sequence(iterators) {
   if (arguments.length > 0) {
      this.iterators = iterators.length ? iterators : [new Literal('')];
   }
}
Sequence.prototype.first = function() {
   for (var i in this.iterators) this.iterators[i].first();
};
Sequence.prototype.next = function() {
   if (this.ok()) {
      var i = this.iterators.length;
      while (this.iterators[--i].next(), i > 0 && !this.iterators[i].ok()) {
         this.iterators[i].first();
      }
   }
};
Sequence.prototype.ok = function() {
   return this.iterators[0].ok();
};
Sequence.prototype.get = function() {
   var retval = '';
   for (var i in this.iterators) {
      retval += this.iterators[i].get();
   }
   return retval;
};
Sequence.prototype.clone = function() {
   return new Sequence(this.iterators.map(function(it) { return it.clone(); }));
};

组合迭代器的另一种方法是选择(a.k.a. alternative),例如: foo | bar

Another way to combine iterators is the choice (a.k.a. alternatives), e.g. foo|bar.

function Choice(iterators) { this.iterators = iterators; }

Choice.prototype.first = function() {
   this.count = 0;
   for (var i in this.iterators) this.iterators[i].first();
};
Choice.prototype.next = function() {
   if (this.ok()) {
      this.iterators[this.count].next();
      while (this.ok() && !this.iterators[this.count].ok()) this.count++;
   }
};
Choice.prototype.ok = function() {
   return this.count < this.iterators.length;
};
Choice.prototype.get = function() {
   return this.iterators[this.count].get();
};
Choice.prototype.clone = function() {
   return new Choice(this.iterators.map(function(it) { return it.clone(); }));
};

其他正则表达式功能可以通过组合现有类来实现。类继承是一种很好的方法。例如,可选模式( x?)只是空字符串和 x 之间的选择。

Other regex features can be implemented by combining the existing classes. Class inheritance is a great way to do this. For example, an optional pattern (x?) is just a choice between the empty string and x.

function Optional(iterator) {
   if (arguments.length > 0) {
      Choice.call(this, [new Literal(''), iterator]);
   }
}
Optional.prototype = new Choice();

重复( x {n,m} )是序列的组合和可选。因为我必须继承一个或另一个,我的实现包含两个相互依赖的类。

Repetition (x{n,m}) is a combination of Sequence and Optional. Because I have to inherit one or the other, my implementation consists of two mutually dependent classes.

function RepeatFromZero(maxTimes, iterator) {
   if (arguments.length > 0) {
      Optional.call(this, new Repeat(1, maxTimes, iterator));
   }
}
RepeatFromZero.prototype = new Optional();

function Repeat(minTimes, maxTimes, iterator) {
   if (arguments.length > 0) {
      var sequence = [];
      for (var i = 0; i < minTimes; i++) {
         sequence.push(iterator.clone());   // need to clone the iterator
      }
      if (minTimes < maxTimes) {
         sequence.push(new RepeatFromZero(maxTimes - minTimes, iterator));
      }
      Sequence.call(this, sequence);
   }
}
Repeat.prototype = new Sequence();

正如我之前所说,迭代器可以包装到枚举器中。这只是一个你可以随时打破的循环。

As I said earlier, an iterator can be wrapped into an enumerator. This is simply a loop that you can break whenever you want.

function Enumerator(iterator) {
   this.iterator = iterator;

   this.each = function(callback) {
      for (this.iterator.first(); this.iterator.ok(); this.iterator.next()) {
         callback(this.iterator.get());
      }
   };
}

将所有这些放在一起的时间。让我们采取一些愚蠢的正则表达式:

Time to put it all together. Let's take some silly regular expression:

([ab]{2}){1,2}|[cd](f|ef{0,2}e)

编写迭代器对象非常简单:

Composing the iterator object is really straightforward:

function GetIterationsAsHtml() {

   var iterator = new Choice([
      new Repeat(1, 2,
         new Repeat(2, 2, new CharacterClass('ab'))),
      new Sequence([
         new CharacterClass('cd'),
         new Choice([
            new Literal('f'),
            new Sequence([
               new Literal('e'),
               new RepeatFromZero(2, new Literal('f')),
               new Literal('e')
            ])
         ])
      ])
   ]);

   var iterations = '<ol>\n';
   var enumerator = new Enumerator(iterator);
   enumerator.each(function(iteration) { iterations += '<li>' + iteration + '</li>\n'; });
   return iterations + '</ol>';
}

这会产生28场比赛,但我会省去你的输出。

This yields 28 matches, but I will spare you the output.

如果我的代码不符合软件模式,浏览器兼容(在Chrome和Firefox上运行正常)或者OOP不佳,我很抱歉。我希望它能让这个概念变得清晰。

My apologies if my code is not compliant to software patterns, is not browser-compatible (works OK on Chrome and Firefox) or suffers from poor OOP. I just hope it makes the concept clear.

编辑:为了完整性,并且在OP的倡议下,我又实现了一个迭代器类: reference

for completeness, and following OP's initiative, I implemented one more iterator class: the reference.

引用(\1 \\\ etc等)获取先前捕获组的当前匹配(即括号中的任何内容) 。它的实现与 Literal 非常相似,因为它只有一个匹配。

A reference (\1 \2 etc) picks up the current match of an earlier capturing group (i.e. anything in parentheses). Its implementation is very similar to Literal, in that it has exactly one match.

function Reference(iterator) { this.iterator = iterator; }

Reference.prototype.first = function() { this.i = 0; };
Reference.prototype.next  = function() { this.i++; };
Reference.prototype.ok    = function() { return this.i == 0; };
Reference.prototype.get   = function() { return this.iterator.get(); };
Reference.prototype.clone = function() { return new Reference(this.iterator); };

构造函数被赋予一个代表引用的子模式的迭代器。以(foo | bar)([xy])\2\1 为例(产生 fooxxfoo,fooyyfoo,barxxbar,baryybar ) :

The constructor is given an iterator that represents the referenced subpattern. Taking (foo|bar)([xy])\2\1 as an example (yields fooxxfoo, fooyyfoo, barxxbar, baryybar):

var groups = new Array();

var iterator = new Sequence([
   groups[1] = new Choice([new Literal('foo'), new Literal('bar')]),
   groups[2] = new CharacterClass('xy'),
   new Reference(groups[2]),
   new Reference(groups[1])
]);

在构建迭代器类树时指定捕获组。我仍然在这里手动执行此操作,但最终您希望将其自动化。这只是将您的解析树映射到类似的迭代器类树的问题。

Capturing groups are specified as you build up the tree of iterator classes. I am still doing that manually here, but eventually you want this to be automated. That is just a matter of mapping your parse tree to a similar tree of iterator classes.

编辑2:这里是一个相对简单的递归函数,将 ret.js 生成的解析树转换为迭代器。

EDIT 2: here's a relatively simple recursive function that will convert a parse tree produced by ret.js into an iterator.

function ParseTreeMapper() {
    this.capturingGroups = [];
}
ParseTreeMapper.prototype.mapToIterator = function(parseTree) {
    switch (parseTree.type) {
        case ret.types.ROOT:
        case ret.types.GROUP:
            var me = this;
            var mapToSequence = function(parseTrees) {
                return new Sequence(parseTrees.map(function(t) {
                    return me.mapToIterator(t);
                }));
            };
            var group = parseTree.options ?
                new Choice(parseTree.options.map(mapToSequence)) : 
                mapToSequence(parseTree.stack);
            if (parseTree.remember) {
                this.capturingGroups.push(group);
            }
            return group;
        case ret.types.SET:
            return new CharacterClass(this.mapToCharacterClass(parseTree.set));
        case ret.types.REPETITION:
            return new Repeat(parseInt(parseTree.min), parseInt(parseTree.max), this.mapToIterator(parseTree.value));
        case ret.types.REFERENCE:
            var ref = parseInt(parseTree.value) - 1;
            return ref in this.capturingGroups ?
                new Reference(this.capturingGroups[ref]) :
                new Literal('<ReferenceOutOfRange>');
        case ret.types.CHAR:
            return new Literal(String.fromCharCode(parseTree.value));
        default:
            return new Literal('<UnsupportedType>');
    }
};
ParseTreeMapper.prototype.mapToCharacterClass = function(parseTrees) {
    var chars = '';
    for (var i in parseTrees) {
        var tree = parseTrees[i];
        switch (tree.type) {
            case ret.types.CHAR:
                chars += String.fromCharCode(tree.value);
                break;
            case ret.types.RANGE:
                for (var code = tree.from; code <= tree.to; code++) {
                    chars += String.fromCharCode(code);
                }
                break;
        }
    }
    return chars;
};

用法:

var regex = 'b[a-n]{3}';
var parseTree = ret(regex);    // requires ret.js
var iterator = new ParseTreeMapper().mapToIterator(parseTree);

我把所有组件放在这个演示中: http://jsfiddle.net/Pmnwk/3/

I put all components together in this demo: http://jsfiddle.net/Pmnwk/3/

注意:许多正则表达式语法结构是不支持(锚点,前瞻,后瞻,递归),但我想它已经与invRegex.py相提并论了。

Note: many regex syntax constructs are not supported (anchors, look-ahead, look-behind, recursion), but I guess it is already pretty much up to par with invRegex.py.

这篇关于将invRegex.py移植到Javascript(Node.js)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆