Perl正则表达式，用于提取多行LaTeX章节名称 [英] Perl Regular Expression for extracting multi-line LaTeX chapter name

查看：211 发布时间：2020/4/29 3:59:45 regex perl latex

本文介绍了Perl正则表达式，用于提取多行LaTeX章节名称的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我很难弄清楚如何执行正则表达式替换来清除LaTeX文件中的某些文本. LaTeX文件看起来像

I am having a hard time figuring out how to perform a regex substitution to clean up some text in a LaTeX file. The LaTeX file looks like

\chapter{\texorpdfstring{{II} {The Chapter 
Title}}{II The Chapter Title}}

令人讨厌的是，这是一个多行章节声明，新行实际上可以出现在任何地方.我不能使用常见的<>习惯用法来逐行读取文件并执行简单明了的正则表达式.

Annoyingly, this is a multi-line chapter declaration, and the new line can occur virtually anywhere. I can't use the common <> idioms to just read the file line by line and perform the straight-forward regular expression.

相反，我正在尝试:

#!/usr/bin/perl -i.old     # In-place edit, backup as '.old'
use strict;
use warnings;

use Path::Tiny;

my $filename = shift or die "Usage: $0 FILENAME";
my $content = path($filename)->slurp_utf8;

$content =~ s|\\chapter\{.*\{[IVXLCDM]*\s*(.*)\}\}|\\chapter{$1}|gms;
path($filename)->spew_utf8($content);

但是，正则表达式过于贪婪，它在第一个\chapter声明处开始匹配，并在最后一个chapter声明处结束.我只想做

However, the regular expression is far too greedy, and begins a match at the first \chapter declaration and ends it at the last chapter declaration. All I want is to

删除\texorpdfstring.
删除罗马数字
删除章节标题的多次出现

以便我替代

\chapter{\texorpdfstring{{I} {The First 
Chapter}}{I The First Chapter}}

It was the best of times.

\chapter{\texorpdfstring{{II} {The Second 
Chapter}}{II The Second Chapter}}

It was the worst of times.

产生

\chapter{The First Chapter}

It was the best of times.

\chapter{The Second Chapter}

It was the worst of times.

我现在该怎么办?

我更改了演示文本.

如果我对@zdim的理解正确，他会写下替换内容而不忽略括号{}的位置，以使其更易于验证.很公平.我尝试了@zdim的解决方案，但输出:

If I understood @zdim correctly, he wrote down the substitution without escaping the braces {}'s, to make it easier to validate. Fair enough. I tried @zdim's solution but it output:

\chapter{The First
Chapter}

It was the worst of times.

推荐答案

如果只能显示所示的{...}

s/\\chapter{\\texorpdfstring{{ .*? }\s*{ (.*?) }}\s*{.*?}}/\\chapter{$1}/gsx;

或

s/(\\chapter){\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/${1}{$2}/gs;

其中${1}(用于$1)用于语法，因为$1{...将被解释为%1的值.

where ${1} (for $1) is needed for syntax, as $1{... would be interpreted as a value of %1.

或者，宁可

s/\\chapter\K{\s*\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/{$1}/gs

，其中 \K形式的后视会删除之前的匹配项.我仍然留下{来重新键入，以便获得更清晰的替换零件.

where the \K form of lookbehind drops previous matches. I still leave { to retype for a possibly clearer replacement part.

请在可能有空格的地方撒上\s*.

Please sprinkle this with \s* where there may be spaces.

还要注意 Path :: Tiny :: edit_utf8

path($filename)->edit_utf8( sub { s/.../.../gs } );  # regex as above

将匿名子应用到已处理的文件，而不是edit_lines.

which applies the anonymous sub to the slurped file, as opposed to edit_lines.

如果括号表达式可以更自由地嵌套(例如{\em ... }等)，则需要更加系统的方法.例如，参见 Text :: Balanced 并搜索嵌套定界符".

If the braced expressions can be nested more freely (say with {\em ... } and such) a far more systemic approach is needed. See for example Text::Balanced and search for "nested delimiters."

一些正则表达式资源

Perl文档

perlretut ，教程

perlrequick ，快速入门

perlre ，完整的语法说明

perlreref ，快速参考(其

perlreref, a quick reference (its See Also section is useful on its own)

Stackoverflow

正则表达式信息具有资源的入口门户

Regex info An entry portal with resources

参考:此正则表达式是什么意思?大量的常见问题解答列表，其中包含指向SO帖子的链接

Reference: What does this regex mean? A gargantuan list of FAQs with links to SO posts

学习正则表达式概述，结尾处有一长串资源

Learning Regular expressions An overview with a long list of resources at the end

Regular-Expressions.info

这篇关于Perl正则表达式，用于提取多行LaTeX章节名称的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Perl正则表达式，用于提取多行LaTeX章节名称 [英] Perl Regular Expression for extracting multi-line LaTeX chapter name

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Perl正则表达式，用于提取多行LaTeX章节名称 [英] Perl Regular Expression for extracting multi-line LaTeX chapter name

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭