优化字符串解析 [英] Optimize String Parsing

查看:158
本文介绍了优化字符串解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要以txf格式解析数据文件。这些文件可能包含1000多个条目。由于格式定义如JSON,我想制作一个像JSON这样的通用解析器,它可以序列化和反序列化txf文件。

I have a requirement to parse data files in "txf" format. The files may contain more than 1000 entries. Since the format is well defined like JSON, I wanted to make a generic parser like JSON, which can serialise and deserialise txf files.

与JSON相反,标记为没有办法识别对象或数组。如果出现具有相同标记的条目,我们需要将其视为数组。

On contrary to JSON, the mark up doesn't have a way to identify an object or an array. If an entry with same tag occurs, we need to consider it as an array.


  1. 标记对象的开头。

  2. $ 标记对象的成员

  3. / 标记对象的结尾

  1. # Marks the start of an object.
  2. $ Marks the members of an object
  3. / Marks the end of an object

以下是示例txf文件

#Employees
$LastUpdated=2015-02-01 14:01:00
#Employee
$Id=1
$Name=Employee 01
#Departments
$LastUpdated=2015-02-01 14:01:00
#Department
$Id=1
$Name=Department Name
/Department
/Departments
/Employee
#Employee
/Employee
/Employees

我能够创建一个通用的 TXF Parser 使用NSScanner。但是有了更多的条目,性能需要更多的调整。

I was able to create a generic TXF Parser using NSScanner. But with more entries the performance needs more tweaking.

我写了基础对象获得的 plist 并比较了它的性能再次我写的解析器。我的解析器比 plist 解析器慢大约10倍。

I wrote the foundation object obtained as plist and compared its performance again the parser I wrote. My parser is around 10 times slower than plist parser.

虽然 plist 文件大小是 txf 并且有更多的标记字符,我觉得有很多优化空间。

While plist file size is 5 times more than txf and has more markup characters, I feel that there is a lot of room for optimization.

非常感谢在这方面的任何帮助。

Any help in that direction is highly appreciated.

编辑:包括解析代码

static NSString *const kArray    = @"TXFArray";
static NSString *const kBodyText = @"TXFText";

@interface TXFParser ()

/*Temporary variable to hold values of an object*/
@property (nonatomic, strong) NSMutableDictionary *dict;

/*An array to hold the hierarchial data of all nodes encountered while parsing*/
@property (nonatomic, strong) NSMutableArray *stack;

@end

@implementation TXFParser

#pragma mark - Getters

- (NSMutableArray *)stack{
    if (!_stack) {
        _stack = [NSMutableArray new];
    }return _stack;
}

#pragma mark -

- (id)objectFromString:(NSString *)txfString{
    [txfString enumerateLinesUsingBlock:^(NSString *string, BOOL *stop) {
        if ([string hasPrefix:@"#"]) {
            [self didStartParsingTag:[string substringFromIndex:1]];
        }else if([string hasPrefix:@"$"]){
            [self didFindKeyValuePair:[string substringFromIndex:1]];
        }else if([string hasPrefix:@"/"]){
            [self didEndParsingTag:[string substringFromIndex:1]];
        }else{
            //[self didFindBodyValue:string];
        }
    }]; return self.dict;
}

#pragma mark -

- (void)didStartParsingTag:(NSString *)tag{
    [self parserFoundObjectStartForKey:tag];
}

- (void)didFindKeyValuePair:(NSString *)tag{
    NSArray *components = [tag componentsSeparatedByString:@"="];
    NSString *key = [components firstObject];
    NSString *value = [components lastObject];

    if (key.length) {
        self.dict[key] = value?:@"";
    }
}

- (void)didFindBodyValue:(NSString *)bodyString{
    if (!bodyString.length) return;
    bodyString = [bodyString stringByTrimmingCharactersInSet:[NSCharacterSet illegalCharacterSet]];
    if (!bodyString.length) return;

    self.dict[kBodyText] = bodyString;
}

- (void)didEndParsingTag:(NSString *)tag{
    [self parserFoundObjectEndForKey:tag];
}

#pragma mark -

- (void)parserFoundObjectStartForKey:(NSString *)key{
    self.dict = [NSMutableDictionary new];
    [self.stack addObject:self.dict];
}

- (void)parserFoundObjectEndForKey:(NSString *)key{
    NSDictionary *dict = self.dict;

    //Remove the last value of stack
    [self.stack removeLastObject];

    //Load the previous object as dict
    self.dict = [self.stack lastObject];

    //The stack has contents, then we need to append objects
    if ([self.stack count]) {
        [self addObject:dict forKey:key];
    }else{
        //This is root object,wrap with key and assign output
        self.dict = (NSMutableDictionary *)[self wrapObject:dict withKey:key];
    }
}

#pragma mark - Add Objects after finding end tag

- (void)addObject:(id)dict forKey:(NSString *)key{
    //If there is no value, bailout
    if (!dict) return;

    //Check if the dict already has a value for key array.
    NSMutableArray *array =  self.dict[kArray];

    //If array key is not found look for another object with same key
    if (array) {
        //Array found add current object after wrapping with key
        NSDictionary *currentDict = [self wrapObject:dict withKey:key];
        [array addObject:currentDict];
    }else{
        id prevObj = self.dict[key];
        if (prevObj) {
            /*
             There is a prev value for the same key. That means we need to wrap that object in a collection.
             1. Remove the object from dictionary,
             2. Wrap it with its key
             3. Add the prev and current value to array
             4. Save the array back to dict
             */
            [self.dict removeObjectForKey:key];
            NSDictionary *prevDict = [self wrapObject:prevObj withKey:key];
            NSDictionary *currentDict = [self wrapObject:dict withKey:key];
            self.dict[kArray] = [@[prevDict,currentDict] mutableCopy];

        }else{
            //Simply add object to dict
            self.dict[key] = dict;
        }
    }
}

/*Wraps Object with a key for the serializer to generate txf tag*/
- (NSDictionary *)wrapObject:(id)obj withKey:(NSString *)key{
    if (!key ||!obj) {
        return @{};
    }
    return @{key:obj};
}

编辑2:

具有1000多个条目的 TXF文件示例。

A sample TXF file with more than 1000 entries.

推荐答案

您是否考虑过使用拉式阅读&递归处理?这将消除将整个文件读入内存并消除管理一些自己的堆栈以跟踪您正在解析的深度。

Have you considered using pull-style reads & recursive processing? That would eliminate reading the whole file into memory and also eliminate managing some own stack to keep track how deep you're parsing.

下面的Swift示例。该示例适用于您的示例txf,但不适用于Dropbox版本;你的一些成员跨越多行。如果这是一个要求,它可以很容易地实现到开关/案例$部分。但是,我也没有看到你自己的代码处理它。此外,该示例尚未遵循正确的Swift错误处理(解析方法需要额外的 NSError 参数)

Below an example in Swift. The example works with your sample "txf", but not with the dropbox version; some of your "members" span over multiple lines. If this is a requirement, it can easily be implemented into switch/case "$" section. However, I don't see your own code handling this either. Also, the example doesn't follow the correct Swift error handling yet (the parse method would need an additional NSError parameter)

import Foundation

extension String
{
    public func indexOfCharacter(char: Character) -> Int? {
        if let idx = find(self, char) {
            return distance(self.startIndex, idx)
        }
        return nil
    }

    func substringToIndex(index:Int) -> String {
        return self.substringToIndex(advance(self.startIndex, index))
    }
    func substringFromIndex(index:Int) -> String {
        return self.substringFromIndex(advance(self.startIndex, index))
    }
}


func parse(aStreamReader:StreamReader, parentTagName:String) -> Dictionary<String,AnyObject> {
    var dict = Dictionary<String,AnyObject>()

    while let line = aStreamReader.nextLine() {

        let firstChar = first(line)
        let theRest = dropFirst(line)

        switch firstChar! {
        case "$":
            if let idx = theRest.indexOfCharacter("=") {
                let key = theRest.substringToIndex(idx)
                let value = theRest.substringFromIndex(idx+1)

                dict[key] = value
            } else {
                println("no = sign")
            }
        case "#":
            let subDict = parse(aStreamReader,theRest)

            var list = dict[theRest] as? [Dictionary<String,AnyObject>]
            if list == nil {
                dict[theRest] = [subDict]
            } else {
                list!.append(subDict)
            }
        case "/":
            if theRest != parentTagName {
                println("mismatch... [\(theRest)] != [\(parentTagName)]")
            } else {
                return dict
            }
        default:
            println("mismatch... [\(line)]")
        }
    }

    println("shouldn't be here...")
    return dict
}


var data : Dictionary<String,AnyObject>?

if let aStreamReader = StreamReader(path: "/Users/taoufik/Desktop/QuickParser/QuickParser/file.txf") {

    if var line = aStreamReader.nextLine() {
        let tagName = line.substringFromIndex(advance(line.startIndex, 1))
        data = parse(aStreamReader, tagName)
    }

    aStreamReader.close()
}

println(JSON(data!))

StreamReader 是从 https://stackoverflow.com/a借来的/ 24648951/95976

  • see full code https://github.com/tofi9/QuickParser
  • pull-style line-by-line read in objective-c: How to read data from NSFileHandle line by line?

我在C ++ 11中重写了上述内容并让它在2012年的不到0.05秒(发布模式)下运行MBA I5使用dropbox上的更新文件。我怀疑 NSDictionary NSArray 必须有一些惩罚。下面的代码可以编译成一个objective-c项目(文件需要有扩展名.mm):

I rewrote the above in C++11 and got it to run in less than 0.05 seconds (release mode) on a 2012 MBA I5 using the updated file on dropbox. I suspect NSDictionary and NSArray must have some penalty. The code below can be compiled into an objective-c project (file needs have extension .mm):

#include <iostream>
#include <sstream>
#include <string>
#include <fstream>
#include <map>
#include <vector>

using namespace std;


class benchmark {

private:
    typedef std::chrono::high_resolution_clock clock;
    typedef std::chrono::milliseconds milliseconds;

    clock::time_point start;

public:
    benchmark(bool startCounting = true) {
        if(startCounting)
            start = clock::now();
    }

    void reset() {
        start = clock::now();
    }

    double elapsed() {
        milliseconds ms = std::chrono::duration_cast<milliseconds>(clock::now() - start);
        double elapsed_secs = ms.count() / 1000.0;
        return elapsed_secs;
    }
};

struct obj {
    map<string,string> properties;
    map<string,vector<obj>> subObjects;
};

obj parse(ifstream& stream, string& parentTagName) {
    obj obj;
    string line;
    while (getline(stream, line))
    {
        auto firstChar = line[0];
        auto rest = line.substr(1);

        switch (firstChar) {
            case '$': {
                auto idx = rest.find_first_of('=');

                if (idx == -1) {
                    ostringstream o;
                    o << "no = sign: " << line;
                    throw o.str();
                }
                auto key = rest.substr(0,idx);
                auto value = rest.substr(idx+1);
                obj.properties[key] = value;
                break;
            }
            case '#': {
                auto subObj = parse(stream, rest);
                obj.subObjects[rest].push_back(subObj);
                break;
            }
            case '/':
                if(rest != parentTagName) {
                    ostringstream o;
                    o << "mismatch end of object " << rest << " != " << parentTagName;
                    throw o.str();
                } else {
                    return obj;
                }
                break;
            default:
                ostringstream o;
                o << "mismatch line " << line;
                throw o.str();
                break;
        }

    }

    throw "I don't know why I'm here. Probably because the file is missing an end of object marker";
}


void visualise(obj& obj, int indent = 0) {
    for(auto& property : obj.properties) {
        cout << string(indent, '\t') << property.first << " = " << property.second << endl;
    }

    for(auto& subObjects : obj.subObjects) {
        for(auto& subObject : subObjects.second) {
            cout << string(indent, '\t') << subObjects.first << ": " << endl;
            visualise(subObject, indent + 1);
        }
    }
}

int main(int argc, const char * argv[]) {
    try {
        obj result;

        benchmark b;
        ifstream stream("/Users/taoufik/Desktop/QuickParser/QuickParser/Members.txf");
        string line;
        if (getline(stream, line))
        {
            string tagName = line.substr(1);
            result = parse(stream, tagName);
        }

        cout << "elapsed " << b.elapsed() <<  " ms" << endl;

        visualise(result);

    }catch(string s) {
        cout << "error " << s;
    }

    return 0;
}



编辑3



请参阅完整代码C ++的链接: https://github.com/tofi9/TxfParser

这篇关于优化字符串解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆