最好的方式来转换整个文件在C小写 [英] Best way to convert whole file to lowercase in C

查看:142
本文介绍了最好的方式来转换整个文件在C小写的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如果一个孤单真的好(高性能)解决方案如何转换整个文件中C.为小写
我用龟etc炭为小写,这在其他临时文件用的fputc写入转换。最后我删除原始和临时文件重命名为旧原件名。但我认为必须有一个更好的解决方案。

I was wondering if theres a realy good (performant) solution how to Convert a whole file to lower Case in C. I use fgetc convert the char to lower case and write it in another temp-file with fputc. At the end i remove the original and rename the tempfile to the old originals name. But i think there must be a better Solution for it.

推荐答案

如果你正在处理大文件(如大,比如说,多兆字节)和这个操作是绝对速度的关键,那么它可能是有意义的超越你问什么。有一件事特别要考虑的是一个字符逐个字符操作比使用SIMD指令执行较差。

If you're processing big files (big as in, say, multi-megabytes) and this operation is absolutely speed-critical, then it might make sense to go beyond what you've inquired about. One thing to consider in particular is that a character-by-character operation will perform less well than using SIMD instructions.

即。如果你会使用SSE2,你可以code中的 toupper_parallel 像(伪code):

I.e. if you'd use SSE2, you could code the toupper_parallel like (pseudocode):

for (cur_parallel_word = begin_of_block;
     cur_parallel_word < end_of_block;
     cur_parallel_word += parallel_word_width) {
    /*
     * in SSE2, parallel compares are either about 'greater' or 'equal'
     * so '>=' and '<=' have to be constructed. This would use 'PCMPGTB'.
     * The 'ALL' macro is supposed to replicate into all parallel bytes.
     */
    mask1 = parallel_compare_greater_than(*cur_parallel_word, ALL('A' - 1));
    mask2 = parallel_compare_greater_than(ALL('Z'), *cur_parallel_word);
    /*
     * vector op - and all bytes in two vectors, 'PAND'
     */
    mask = mask1 & mask2;
    /*
     * vector op - add a vector of bytes. Would use 'PADDB'.
     */
    new = parallel_add(cur_parallel_word, ALL('a' - 'A'));
    /*
     * vector op - zero bytes in the original vector that will be replaced
     */
    *cur_parallel_word &= !mask;           // that'd become 'PANDN'
    /*
     * vector op - extract characters from new that replace old, then or in.
     */
    *cur_parallel_word |= (new & mask);    // PAND / POR
}

即。你会使用并行比较来检查哪些字节大写,然后在你面前掩饰都原值和大写的版本(一个面具,其他与倒数)或它们在一起,形成的结果。

I.e. you'd use parallel comparisons to check which bytes are uppercase, and then mask both original value and 'uppercased' version (one with the mask, the other with the inverse) before you or them together to form the result.

如果您使用mmap'ed文件的访问,这甚至可以就地进行,节省了反弹缓冲区,并节省许多功能和/或系统调用。

If you use mmap'ed file access, this could even be performed in-place, saving on the bounce buffer, and saving on many function and/or system calls.

有很多优化,当你的出发点是一个字符用字符龟etc'/'的fputc'循环;即使外壳公用事业极有可能执行比这更好的。

There is a lot to optimize when your starting point is a character-by-character 'fgetc' / 'fputc' loop; even shell utilities are highly likely to perform better than that.

不过,我同意,如果你需要的是非常特殊的用途(即东西作为旗帜鲜明的为ASCII输入转换为大写),那么手工环路之上,采用矢量指令集(如上交所内部函数/装配,或ARM NEON或PPC AltiVec技术),很可能使一个显著加速可能比现有的通用工具。

But I agree that if your need is very special-purpose (i.e. something as clear-cut as ASCII input to be converted to uppercase) then a handcrafted loop as above, using vector instruction sets (like SSE intrinsics/assembly, or ARM NEON, or PPC Altivec), is likely to make a significant speedup possible over existing general-purpose utilities.

这篇关于最好的方式来转换整个文件在C小写的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆