为什么这个程序在 Python 中比 Objective-C 更快? [英] Why is this program faster in Python than Objective-C?

查看:68
本文介绍了为什么这个程序在 Python 中比 Objective-C 更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 这个小例子 Python 中用于循环遍历大型单词列表的算法.我正在编写一些工具",它们将允许我以与 Python 类似的方式对 Objective-C 字符串或数组进行切片.

I got interested in this small example of an algorithm in Python for looping through a large word list. I am writing a few "tools" that will allow my to slice a Objective-C string or array in a similar fashion as Python.

具体来说,这个优雅的解决方案引起了我的注意,它执行速度非常快,它使用字符串切片作为算法的关键元素.尝试不用切片来解决这个问题!

Specifically, this elegant solution caught my attention for executing very quickly and it uses a string slice as a key element of the algorithm. Try and solve this without a slice!

我使用下面的 Moby 单词列表复制了我的本地版本.如果您不想下载 Moby,可以使用 /usr/share/dict/words.源只是一个大型字典般的独特单词列表.

I have reproduced my local version using the Moby word list below. You can use /usr/share/dict/words if you do not feel like downloading Moby. The source is just a large dictionary-like list of unique words.

#!/usr/bin/env python

count=0
words = set(line.strip() for line in   
           open("/Users/andrew/Downloads/Moby/mwords/354984si.ngl"))
for w in words:
    even, odd = w[::2], w[1::2]
    if even in words and odd in words:
        count+=1

print count      

这个脚本将 a) 被 Python 解释;b) 读取 4.1 MB、354,983 字的 Moby 词典文件;c) 剥线;d) 将线放入一个集合中,并且;e) 并找出给定单词的偶数和几率也是单词的所有组合.这在 MacBook Pro 上执行时间约为 0.73 秒.

This script will a) be interpreted by Python; b) read the 4.1 MB, 354,983 word Moby dictionary file; c) strip the lines; d) place the lines into a set, and; e) and find all the combinations where the evens and the odds of a given word are also words. This executes in about 0.73 seconds on a MacBook Pro.

我尝试在 Objective-C 中重写相同的程序.我是这门语言的初学者,所以请放轻松,但请指出错误.

I tried to rewrite the same program in Objective-C. I am a beginner at this language, so go easy please, but please do point out the errors.

#import <Foundation/Foundation.h>

NSString *sliceString(NSString *inString, NSUInteger start, NSUInteger stop, 
        NSUInteger step){
    NSUInteger strLength = [inString length];

    if(stop > strLength) {
        stop = strLength;
    }

    if(start > strLength) {
        start = strLength;
    }

    NSUInteger capacity = (stop-start)/step;
    NSMutableString *rtr=[NSMutableString stringWithCapacity:capacity];    

    for(NSUInteger i=start; i < stop; i+=step){
        [rtr appendFormat:@"%c",[inString characterAtIndex:i]];
    }
    return rtr;
}

NSSet * getDictWords(NSString *path){

    NSError *error = nil;
    NSString *words = [[NSString alloc] initWithContentsOfFile:path
                         encoding:NSUTF8StringEncoding error:&error];
    NSCharacterSet *sep=[NSCharacterSet newlineCharacterSet];
    NSPredicate *noEmptyStrings = 
                     [NSPredicate predicateWithFormat:@"SELF != ''"];

    if (words == nil) {
        // deal with error ...
    }
    // ...

    NSArray *temp=[words componentsSeparatedByCharactersInSet:sep];
    NSArray *lines = 
        [temp filteredArrayUsingPredicate:noEmptyStrings];

    NSSet *rtr=[NSSet setWithArray:lines];

    NSLog(@"lines: %lul, word set: %lul",[lines count],[rtr count]);
    [words release];

    return rtr;    
}

int main (int argc, const char * argv[])
{
    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];

    int count=0;

    NSSet *dict = 
       getDictWords(@"/Users/andrew/Downloads/Moby/mwords/354984si.ngl");

    NSLog(@"Start");

    for(NSString *element in dict){
        NSString *odd_char=sliceString(element, 1,[element length], 2);
        NSString *even_char=sliceString(element, 0, [element length], 2);
        if([dict member:even_char] && [dict member:odd_char]){
            count++;
        }

    }    
    NSLog(@"count=%i",count);

    [pool drain];
    return 0;
}

Objective-C 版本产生相同的结果,(13,341 个字),但需要将近 3 秒才能完成.对于编译语言比脚本语言慢 3 倍以上,我一定是在做一些非常错误的事情,但如果我能明白为什么,我会被诅咒的.

The Objective-C version produces the same result, (13,341 words), but takes almost 3 seconds to do it. I must be doing something atrociously wrong for a compiled language to be more than 3X slower than a scripted language, but I'll be darned if I can see why.

基本算法是一样的:读取行,剥离它们,然后将它们放在一个集合中.

The basic algorithm is the same: read the lines, strip them, and put them in a set.

我猜想慢的是 NSString 元素的处理,但我不知道替代方法.

My guess of what is slow is the processing of the NSString elements, but I do not know an alternative.

编辑

我将 Python 编辑成这样:

I edited the Python to be this:

#!/usr/bin/env python
import codecs
count=0
words = set(line.strip() for line in 
     codecs.open("/Users/andrew/Downloads/Moby/mwords/354984si.ngl",
          encoding='utf-8'))
for w in words:
    if w[::2] in words and w[1::2] in words:
        count+=1

print count 

为了使 utf-8 与 utf-8 NSString 在同一平面上.这将 Python 减慢到 1.9 秒.

For the utf-8 to be on the same plane as the utf-8 NSString. This slowed the Python down to 1.9 secs.

我还将切片测试切换为短路类型 建议 适用于 Python 和 obj-c 版本.现在它们接近相同的速度.我还尝试使用 C 数组而不是 NSStrings,这要快得多,但并不容易.你也失去了对 utf-8 的支持.

I also switch the slice test to short-circuit type as suggested for both the Python and obj-c version. Now they are close to the same speed. I also tried using C arrays rather than NSStrings, and this was much faster, but not as easy. You also loose utf-8 support doing that.

Python 真的很酷...

Python is really cool...

编辑 2

我发现了一个大大加快了速度的瓶颈.我没有使用 [rtr appendFormat:@"%c",[inString characterAtIndex:i]]; 方法将一个字符附加到返回字符串,而是使用了这个:

I found a bottleneck that sped things up considerably. Instead of using the [rtr appendFormat:@"%c",[inString characterAtIndex:i]]; method to append a character to the return string, I used this:

for(NSUInteger i=start; i < stop; i+=step){
    buf[0]=[inString characterAtIndex:i];
    [rtr appendString:[NSString stringWithCharacters:buf length:1]];
}

现在我可以最后宣称 Objective-C 版本比 Python 版本快——但不会快很多.

Now I can finally claim that the Objective-C version is faster than the Python version -- but not by much.

推荐答案

请记住,Python 版本的编写是为了在 CPython 上执行时将大量繁重的工作转移到高度优化的 C 代码中(尤其是文件输入缓冲、字符串切片和哈希表查找以检查 evenodd 是否在 words 中.

Keep in mind that the Python version has been written to move a lot of the heavy lifting down into highly optimised C code when executed on CPython (especially the file input buffering, string slicing and the hash table lookups to check whether even and odd are in words).

也就是说,您似乎在 Objective-C 代码中将文件解码为 UTF-8,但在 Python 代码中将文件保留为二进制文件.在 Objective-C 版本中使用 Unicode NSString,但在 Python 版本中使用 8 位字节字符串并不是真正公平的比较 - 如果您使用 codecs.open() 打开声明编码为 "utf-8" 的文件.

That said, you seem to be decoding the file as UTF-8 in your Objective-C code, but leaving the file in binary in your Python code. Using Unicode NSString in the Objective-C version, but 8-bit byte strings in the Python version isn't really a fair comparison - I would expect the performance of the Python version to drop noticeably if you used codecs.open() to open the file with a declared encoding of "utf-8".

您还进行了完整的第二遍以去除 Objective-C 中的空行,而 Python 代码中没有这样的步骤.

You're also making a full second pass to strip the empty lines in your Objective-C, while no such step is present in the Python code.

这篇关于为什么这个程序在 Python 中比 Objective-C 更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆