将 invRegex.py 移植到 Javascript (Node.js) [英] Porting invRegex.py to Javascript (Node.js)

查看:21
本文介绍了将 invRegex.py 移植到 Javascript (Node.js)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试移植 invRegex.py 到 node.js 实现一段时间,但我仍在努力解决它.感谢 ret.js 标记器,我已经有了正则表达式解析树,它运行良好,但是以一种内存效率高的方式实际生成和连接所有不同元素对我来说是非常具有挑战性的.为了简单起见,假设我有以下正则表达式:

I have been trying to port invRegex.py to a node.js implementation for a while, but I'm still struggling with it. I already have the regular expression parse tree thanks to the ret.js tokenizer and it works pretty well, but the actual generation and concatenation of all the distinct elements in a way that is memory-efficient is revealing very challenging to me. To keep it simple, lets say I have the following regex:

[01]{1,2}@[a-f]

将其提供给 invRegex.py 会产生以下输出(tabbified 以占用更少的空间):

Feeding that to invRegex.py produces the following output (tabbified to take less space):

 0@a     0@b     0@c     0@d     0@e     0@f
00@a    00@b    00@c    00@d    00@e    00@f
01@a    01@b    01@c    01@d    01@e    01@f
 1@a     1@b     1@c     1@d     1@e     1@f
10@a    10@b    10@c    10@d    10@e    10@f
11@a    11@b    11@c    11@d    11@e    11@f

考虑到我能够获取每个单独的令牌并生成一个包含所有有效单独输出的数组:

Considering I'm able to get each individual token and produce an array of all the valid individual outputs:

[01]{1,2} = function () {
    return ['0', '00', '01', '1', '10', '11'];
};

@ = function () {
    return ['@'];
};

[a-f] = function () {
    return ['a', 'b', 'c', 'd', 'e', 'f'];
};

我可以计算所有数组的笛卡尔积并得到相同的预期输出:

I can compute the cartesian product of all the arrays and get the same expected output:

var _ = require('underscore');

function cartesianProductOf() {
    return _.reduce(arguments, function(a, b) {
        return _.flatten(_.map(a, function(x) {
            return _.map(b, function(y) {
                return x.concat([y]);
            });
        }), true);
    }, [ [] ]);
};

var tokens = [
    ['0', '00', '01', '1', '10', '11'],
    ['@'],
    ['a', 'b', 'c', 'd', 'e', 'f'],
];

var result = cartesianProductOf(tokens[0], tokens[1], tokens[2]);

_.each(result, function (value, key) {
    console.log(value.join(''));
});

问题在于它在内存中保存了所有 36 个值,如果我有一个稍微复杂一点的正则表达式,例如 [az]{0,10} 它会保存 146813779479511 内存中的值,这是完全不可行的.我想以异步方式处理这个庞大的列表,将每个生成的组合传递给回调,并允许我在我认为合适的任何合理点中断该过程,就像 invRegex.py 或 这个 Haskell 包 - 不幸的是,我无法理解 Haskell,我也不知道如何将 Python 中的生成器行为模仿为 Javascript.

The problem with this is that it holds all the 36 values in memory, if I had a slightly more complicated regular expression, such as [a-z]{0,10} it would hold 146813779479511 values in memory, which is totally unfeasible. I would like to process this huge list in an asynchronous fashion, passing each generated combination to a callback and allowing me to interrupt the process at any sensible point I see fit, much like invRegex.py or this Haskell package - unfortunately I can't understand Haskell and I don't know how to mimic the generator behavior in Python to Javascript either.

我尝试在节点 0.11.9(使用 --harmony)中运行几个简单的生成器实验,如下所示:

I tried running a couple of simple generator experiments in node 0.11.9 (with --harmony) like this one:

function* alpha() {
    yield 'a'; yield 'b'; yield 'c';
}

function* numeric() {
    yield '0'; yield '1';
}

function* alphanumeric() {
    yield* alpha() + numeric(); // what's the diff between yield and yield*?
}

for (var i of alphanumeric()) {
    console.log(i);
}

不用说上面的行不通.=/

Needless to say the above doesn't work. =/

在这里把我的头撞到墙上,所以任何解决这个问题的帮助将不胜感激.

Banging my head against the wall here, so any help tackling this problem would be highly appreciated.

更新:这是 b[a-z]{3} 的示例 ret.js 解析树:

UPDATE: Here is a sample ret.js parse tree for b[a-z]{3}:

{
    "type": ret.types.ROOT,
    "stack": [
            {
                "type": ret.types.CHAR,
                "value": 98 // b
            },
            {
                "type": ret.types.REPETITION,
                "max": 3,
                "min": 3,
                "value": {
                    "type": ret.types.SET,
                    "not": false,
                    "set": [
                        {
                            "type": ret.types.RANGE,
                            "from": 97, // a
                            "to": 122   // z
                        }
                    ]
                }
            }
        ]
    ]
}

SET/RANGE 类型应该产生 26 个不同的值,而父 REPETITION 类型应该将之前的值取 3 次方,产生 17576 个不同的组合.如果我要生成一个扁平化的 tokens 数组,就像我之前为 cartesianProductOf 所做的那样,中间扁平化值将占用与实际笛卡尔积本身一样多的空间.

The SET / RANGE type should yield 26 distinct values, and the parent REPETITION type should take that previous value to the power of 3, yielding 17576 distinct combinations. If I was to generate a flattened out tokens array like I did before for cartesianProductOf, the intermediate flattened values would take as much space as the actual cartesian product itself.

我希望这个例子能更好地解释我面临的问题.

I hope this example explains better the problem I am facing.

推荐答案

我建议你写迭代器类.它们易于实现(基本上它们是状态机),内存占用少,可以组合起来构建越来越复杂的表达式(请向下滚动查看最终结果),生成的迭代器可以包装在枚举器.

I advise you to write iterator classes. They are easy to implement (basically they are state machines), they have a low memory footprint, they can be combined to build up increasingly complex expressions (please scroll down to see the final result), and the resulting iterator can be wrapped in an enumerator.

每个迭代器类都有以下方法:

Each iterator class has the following methods:

  • first:初始化状态机(第一次匹配)
  • 下一个:进入下一个状态(下一场比赛)
  • ok:最初为真,但一旦下一个"超出最后一个匹配项,则变为假
  • get: 返回当前匹配项(作为字符串)
  • 克隆: 克隆对象;对重复至关重要,因为每个实例都需要自己的状态
  • first: initializes the state machine (first match)
  • next: proceeds to the next state (next match)
  • ok: initially true, but becomes false once 'next' proceeds beyond the last match
  • get: returns the current match (as a string)
  • clone: clones the object; essential for repetition, because each instance needs its own state

从最简单的情况开始:应按字面意思匹配的一个或多个字符序列(例如 /foo/).不用说这只有一个匹配项,所以在第一次调用 'next' 时,'ok' 将变为 false.

Start off with the most trivial case: a sequence of one or more characters that should be matched literally (e.g. /foo/). Needless to say this has only one match, so 'ok' will become false upon the first call of 'next'.

function Literal(literal) { this.literal = literal; }

Literal.prototype.first = function() { this.i = 0; };
Literal.prototype.next = function() { this.i++; };
Literal.prototype.ok = function() { return this.i == 0; };
Literal.prototype.get = function() { return this.literal; };
Literal.prototype.clone = function() { return new Literal(this.literal); };

字符类 ([abc]) 也很简单.构造函数接受一个字符串;如果你更喜欢数组,这很容易解决.

Character classes ([abc]) are trivial too. The constructor accepts a string of characters; if you prefer arrays, that's easy to fix.

function CharacterClass(chars) { this.chars = chars; }

CharacterClass.prototype.first = function() { this.i = 0; };
CharacterClass.prototype.next = function() { this.i++; };
CharacterClass.prototype.ok = function() { return this.i < this.chars.length; };
CharacterClass.prototype.get = function() { return this.chars.charAt(this.i); };
CharacterClass.prototype.clone = function() { return new CharacterClass(this.chars); };

现在我们需要结合其他迭代器来形成更复杂的正则表达式的迭代器.序列只是连续的两个或多个模式(如 foo[abc]).

Now we need iterators that combine other iterators to form more complex regular expressions. A sequence is just two or more patterns in a row (like foo[abc]).

function Sequence(iterators) {
   if (arguments.length > 0) {
      this.iterators = iterators.length ? iterators : [new Literal('')];
   }
}
Sequence.prototype.first = function() {
   for (var i in this.iterators) this.iterators[i].first();
};
Sequence.prototype.next = function() {
   if (this.ok()) {
      var i = this.iterators.length;
      while (this.iterators[--i].next(), i > 0 && !this.iterators[i].ok()) {
         this.iterators[i].first();
      }
   }
};
Sequence.prototype.ok = function() {
   return this.iterators[0].ok();
};
Sequence.prototype.get = function() {
   var retval = '';
   for (var i in this.iterators) {
      retval += this.iterators[i].get();
   }
   return retval;
};
Sequence.prototype.clone = function() {
   return new Sequence(this.iterators.map(function(it) { return it.clone(); }));
};

另一种组合迭代器的方法是选择(也称为替代品),例如foo|bar.

Another way to combine iterators is the choice (a.k.a. alternatives), e.g. foo|bar.

function Choice(iterators) { this.iterators = iterators; }

Choice.prototype.first = function() {
   this.count = 0;
   for (var i in this.iterators) this.iterators[i].first();
};
Choice.prototype.next = function() {
   if (this.ok()) {
      this.iterators[this.count].next();
      while (this.ok() && !this.iterators[this.count].ok()) this.count++;
   }
};
Choice.prototype.ok = function() {
   return this.count < this.iterators.length;
};
Choice.prototype.get = function() {
   return this.iterators[this.count].get();
};
Choice.prototype.clone = function() {
   return new Choice(this.iterators.map(function(it) { return it.clone(); }));
};

其他正则表达式功能可以通过组合现有的类来实现.类继承是一个很好的方法来做到这一点.例如,可选模式 (x?) 只是在空字符串和 x 之间进行选择.

Other regex features can be implemented by combining the existing classes. Class inheritance is a great way to do this. For example, an optional pattern (x?) is just a choice between the empty string and x.

function Optional(iterator) {
   if (arguments.length > 0) {
      Choice.call(this, [new Literal(''), iterator]);
   }
}
Optional.prototype = new Choice();

重复 (x{n,m}) 是序列和可选的组合.因为我必须继承其中一个,所以我的实现由两个相互依赖的类组成.

Repetition (x{n,m}) is a combination of Sequence and Optional. Because I have to inherit one or the other, my implementation consists of two mutually dependent classes.

function RepeatFromZero(maxTimes, iterator) {
   if (arguments.length > 0) {
      Optional.call(this, new Repeat(1, maxTimes, iterator));
   }
}
RepeatFromZero.prototype = new Optional();

function Repeat(minTimes, maxTimes, iterator) {
   if (arguments.length > 0) {
      var sequence = [];
      for (var i = 0; i < minTimes; i++) {
         sequence.push(iterator.clone());   // need to clone the iterator
      }
      if (minTimes < maxTimes) {
         sequence.push(new RepeatFromZero(maxTimes - minTimes, iterator));
      }
      Sequence.call(this, sequence);
   }
}
Repeat.prototype = new Sequence();

正如我之前所说,迭代器可以包装到枚举器中.这只是一个循环,您可以随时中断.

As I said earlier, an iterator can be wrapped into an enumerator. This is simply a loop that you can break whenever you want.

function Enumerator(iterator) {
   this.iterator = iterator;

   this.each = function(callback) {
      for (this.iterator.first(); this.iterator.ok(); this.iterator.next()) {
         callback(this.iterator.get());
      }
   };
}

是时候把它们放在一起了.让我们用一些愚蠢的正则表达式:

Time to put it all together. Let's take some silly regular expression:

([ab]{2}){1,2}|[cd](f|ef{0,2}e)

组合迭代器对象非常简单:

Composing the iterator object is really straightforward:

function GetIterationsAsHtml() {

   var iterator = new Choice([
      new Repeat(1, 2,
         new Repeat(2, 2, new CharacterClass('ab'))),
      new Sequence([
         new CharacterClass('cd'),
         new Choice([
            new Literal('f'),
            new Sequence([
               new Literal('e'),
               new RepeatFromZero(2, new Literal('f')),
               new Literal('e')
            ])
         ])
      ])
   ]);

   var iterations = '<ol>
';
   var enumerator = new Enumerator(iterator);
   enumerator.each(function(iteration) { iterations += '<li>' + iteration + '</li>
'; });
   return iterations + '</ol>';
}

这会产生 28 个匹配项,但我会省去你的输出.

This yields 28 matches, but I will spare you the output.

如果我的代码不符合软件模式、不兼容浏览器(在 Chrome 和 Firefox 上运行良好)或 OOP 不佳,我深表歉意.我只是希望它能让这个概念变得清晰.

My apologies if my code is not compliant to software patterns, is not browser-compatible (works OK on Chrome and Firefox) or suffers from poor OOP. I just hope it makes the concept clear.

为了完整起见,在 OP 的倡议下,我又实现了一个迭代器类:reference.

for completeness, and following OP's initiative, I implemented one more iterator class: the reference.

引用(1 2 等)获取较早捕获组的当前匹配项(即括号中的任何内容).它的实现与 Literal 非常相似,因为它只有一个匹配项.

A reference (1 2 etc) picks up the current match of an earlier capturing group (i.e. anything in parentheses). Its implementation is very similar to Literal, in that it has exactly one match.

function Reference(iterator) { this.iterator = iterator; }

Reference.prototype.first = function() { this.i = 0; };
Reference.prototype.next  = function() { this.i++; };
Reference.prototype.ok    = function() { return this.i == 0; };
Reference.prototype.get   = function() { return this.iterator.get(); };
Reference.prototype.clone = function() { return new Reference(this.iterator); };

构造函数被赋予一个迭代器,它代表被引用的子模式.以 (foo|bar)([xy])21 为例(产生 fooxxfoo, fooyyfoo, barxxbar, baryybar):

The constructor is given an iterator that represents the referenced subpattern. Taking (foo|bar)([xy])21 as an example (yields fooxxfoo, fooyyfoo, barxxbar, baryybar):

var groups = new Array();

var iterator = new Sequence([
   groups[1] = new Choice([new Literal('foo'), new Literal('bar')]),
   groups[2] = new CharacterClass('xy'),
   new Reference(groups[2]),
   new Reference(groups[1])
]);

在构建迭代器类树时指定捕获组.我仍然在这里手动执行此操作,但最终您希望将其自动化.这只是将您的解析树映射到类似的迭代器类树的问题.

Capturing groups are specified as you build up the tree of iterator classes. I am still doing that manually here, but eventually you want this to be automated. That is just a matter of mapping your parse tree to a similar tree of iterator classes.

编辑 2: 这是一个相对简单的递归函数,它将转换由 ret.js 变成一个迭代器.

EDIT 2: here's a relatively simple recursive function that will convert a parse tree produced by ret.js into an iterator.

function ParseTreeMapper() {
    this.capturingGroups = [];
}
ParseTreeMapper.prototype.mapToIterator = function(parseTree) {
    switch (parseTree.type) {
        case ret.types.ROOT:
        case ret.types.GROUP:
            var me = this;
            var mapToSequence = function(parseTrees) {
                return new Sequence(parseTrees.map(function(t) {
                    return me.mapToIterator(t);
                }));
            };
            var group = parseTree.options ?
                new Choice(parseTree.options.map(mapToSequence)) : 
                mapToSequence(parseTree.stack);
            if (parseTree.remember) {
                this.capturingGroups.push(group);
            }
            return group;
        case ret.types.SET:
            return new CharacterClass(this.mapToCharacterClass(parseTree.set));
        case ret.types.REPETITION:
            return new Repeat(parseInt(parseTree.min), parseInt(parseTree.max), this.mapToIterator(parseTree.value));
        case ret.types.REFERENCE:
            var ref = parseInt(parseTree.value) - 1;
            return ref in this.capturingGroups ?
                new Reference(this.capturingGroups[ref]) :
                new Literal('<ReferenceOutOfRange>');
        case ret.types.CHAR:
            return new Literal(String.fromCharCode(parseTree.value));
        default:
            return new Literal('<UnsupportedType>');
    }
};
ParseTreeMapper.prototype.mapToCharacterClass = function(parseTrees) {
    var chars = '';
    for (var i in parseTrees) {
        var tree = parseTrees[i];
        switch (tree.type) {
            case ret.types.CHAR:
                chars += String.fromCharCode(tree.value);
                break;
            case ret.types.RANGE:
                for (var code = tree.from; code <= tree.to; code++) {
                    chars += String.fromCharCode(code);
                }
                break;
        }
    }
    return chars;
};

用法:

var regex = 'b[a-n]{3}';
var parseTree = ret(regex);    // requires ret.js
var iterator = new ParseTreeMapper().mapToIterator(parseTree);

我在这个演示中将所有组件放在一起:http://jsfiddle.net/Pmnwk/3/

I put all components together in this demo: http://jsfiddle.net/Pmnwk/3/

注意:不支持许多正则表达式语法结构(锚、前瞻、后视、递归),但我想它已经与 invRegex.py 相当.

Note: many regex syntax constructs are not supported (anchors, look-ahead, look-behind, recursion), but I guess it is already pretty much up to par with invRegex.py.

这篇关于将 invRegex.py 移植到 Javascript (Node.js)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆