如何修剪空/空白行? [英] How can I trim empty/whitespace lines?

查看:48
本文介绍了如何修剪空/空白行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须处理带有创造性缩进的严重管理不善的文本.我想删除文本开头和结尾的空行(或空白行)而不触及任何其他内容;这意味着如果第一行或最后一行分别以空格开头或结尾,这些将保留.

I have to process badly mismanaged text with creative indentation. I want to remove the empty (or whitespace) lines at the beginning and end of my text without touching anything else; meaning that if the first or last actual lines respectively begin or end with whitespace, these will stay.

例如:

<lines, empty or with whitespaces ...>
<text, maybe preceded by whitespace>
<lines with or without text...>
<text, maybe followed by whitespace>
<lines, empty or with whitespaces ...>

转向

<text, maybe preceded by whitespace>
<lines with or without text...>
<text, maybe followed by whitespace>

保留实际文本行开头和结尾的空格(文本也可能完全是空白)

preserving the spaces at the beginning and the end of the actual text lines (the text might also be entirely whitespace)

用空替换 (\A\s*(\r\n|\Z)|\r\n\s*\Z) 的正则表达式正是我想要的,但正则表达式是有点矫枉过正,我担心在处理有很多行但没有太多修剪的文本时可能会花费我一些时间.

A regex replacing (\A\s*(\r\n|\Z)|\r\n\s*\Z) by emptiness does exactly what I want, but regex is kind of overkill, and I fear it might cost me some time when processing texts with a lot of lines but not much to trim.

另一方面,显式算法很容易制作(只需阅读到非空白/结尾,同时记住最后一个换行,然后截断,然后向后执行相同的操作)但感觉就像我错过了一些明显的东西.

On the other hand, an explicit algorithm is easy to make (just read until a non-whitespace/the end while remembering the last line feed, then truncate, and do the same backwards) but it feels like I'm missing something obvious.

我该怎么做?

推荐答案

正如你从 这个讨论,在 C++ 中修剪空格需要大量的工作.这绝对应该包含在标准库中.

As you can see from this discussion, trimming whitespace requires a lot of work in C++. This should definitely be included in the standard library.

无论如何,我已经检查了如何尽可能简单地做到这一点,但没有什么能与 RegEx 的紧凑性相提并论.至于速度,则另当别论.

Anyway, I've checked how to do it as simply as possible, but nothing comes near the compactness of RegEx. For speed, it's a different story.

在下面,您可以找到执行所需任务的程序的三个版本.使用正则表达式、标准函数和几个索引.最后一个也可以做得更快,因为您可以完全避免复制,但为了公平比较,我留下了它:

In the following you can find three versions of a program which does the required task. With regex, with std functions and with just a couple of indexes. The last one can be also made faster because you can avoid copying altogether, but I left it for fair comparison:

#include <string>
#include <sstream>
#include <chrono>
#include <iostream>
#include <regex>
#include <exception>

struct perf {
    std::chrono::steady_clock::time_point start_;
    perf() : start_(std::chrono::steady_clock::now()) {}
    double elapsed() const {
        auto stop = std::chrono::steady_clock::now();
        std::chrono::duration<double> elapsed_seconds = stop - start_;
        return elapsed_seconds.count();
    }
};

std::string Generate(size_t line_len, size_t empty, size_t nonempty) {
    std::string es(line_len, ' ');
    es += '\n';
    for (size_t i = 0; i < empty; ++i) {
        es += es;
    }

    std::string nes(line_len - 1, ' ');
    es += "a\n";
    for (size_t i = 0; i < nonempty; ++i) {
        nes += nes;
    }

    return es + nes + es;
}


int main()
{
    std::string test;
    //test = "  \n\t\n  \n  \tTEST\n\tTEST\n\t\t\n  TEST\t\n   \t\n \n  ";
    std::cout << "Generating...";
    std::cout.flush();
    test = Generate(1000, 8, 10);
    std::cout << " done." << std::endl;

    std::cout << "Test 1...";
    std::cout.flush();
    perf p1;
    std::string out1;
    std::regex re(R"(^\s*\n|\n\s*$)");
    try {
        out1 = std::regex_replace(test, re, "");
    }
    catch (std::exception& e) {
        std::cout << e.what() << std::endl;
    }
    std::cout << " done. Elapsed time: " << p1.elapsed() << "s" << std::endl;

    std::cout << "Test 2...";
    std::cout.flush();
    perf p2;
    std::stringstream is(test);
    std::string line;
    while (std::getline(is, line) && line.find_first_not_of(" \t\n\v\f\r") == std::string::npos);
    std::string out2 = line;
    size_t end = out2.size();
    while (std::getline(is, line)) {
        out2 += '\n';
        out2 += line;
        if (line.find_first_not_of(" \t\n\v\f\r") != std::string::npos) {
            end = out2.size();
        }
    }
    out2.resize(end);
    std::cout << " done. Elapsed time: " << p2.elapsed() << "s" << std::endl;

    if (out1 == out2) {
        std::cout << "out1 == out2\n";
    }
    else {
        std::cout << "out1 != out2\n";
    }

    std::cout << "Test 3...";
    std::cout.flush();
    perf p3;
    static bool whitespace_table[] = {
        1,1,1,1,1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    };
    size_t sfl = 0; // Start of first line
    for (size_t i = 0, end = test.size(); i < end; ++i) {
        if (test[i] == '\n') {
            sfl = i + 1;
        }
        else if (whitespace_table[(unsigned char)test[i]]) {
            break;
        }
    }
    size_t ell = test.size(); // End of last line
    for (size_t i = test.size(); i-- > 0;) {
        if (test[i] == '\n') {
            ell = i;
        }
        else if (whitespace_table[(unsigned char)test[i]]) {
            break;
        }
    }
    std::string out3 = test.substr(sfl, ell - sfl);
    std::cout << " done. Elapsed time: " << p3.elapsed() << "s" << std::endl;

    if (out1 == out3) {
        std::cout << "out1 == out3\n";
    }
    else {
        std::cout << "out1 != out3\n";
    }

    return 0;
}

C++ Shell 上运行,您会得到以下时间:

Running it on C++ Shell you get these timings:

Generating... done.
Test 1... done. Elapsed time: 4.2288s
Test 2... done. Elapsed time: 0.0077323s
out1 == out2
Test 3... done. Elapsed time: 0.000695783s
out1 == out3

如果性能很重要,最好用真实文件进行测试.

If performance is important, it's better to really test it with the real files.

顺便说一句,这个正则表达式在 MSVC 上不起作用,因为我找不到避免 ^$ 匹配开始和结束的方法行,即禁用多行操作模式.如果你运行它,它会抛出一个异常,说 regex_error(error_complexity): 尝试匹配正则表达式的复杂度超过了预设水平.我想我会问如何应对!

As a side note, this regex doesn't work on MSVC, because I couldn't find a way of avoiding ^ and $ to match the start and end of lines, that is disable the multiline mode of operation. If you run this, it throws an exception saying regex_error(error_complexity): The complexity of an attempted match against a regular expression exceeded a pre-set level. I think I'll ask how to cope with this!

这篇关于如何修剪空/空白行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆