Perl正则表达式,用于提取多行LaTeX章节名称 [英] Perl Regular Expression for extracting multi-line LaTeX chapter name
问题描述
我很难弄清楚如何执行正则表达式替换来清除LaTeX文件中的某些文本. LaTeX文件看起来像
I am having a hard time figuring out how to perform a regex substitution to clean up some text in a LaTeX file. The LaTeX file looks like
\chapter{\texorpdfstring{{II} {The Chapter
Title}}{II The Chapter Title}}
令人讨厌的是,这是一个多行章节声明,新行实际上可以出现在任何地方.我不能使用常见的<>
习惯用法来逐行读取文件并执行简单明了的正则表达式.
Annoyingly, this is a multi-line chapter declaration, and the new line can occur virtually anywhere. I can't use the common <>
idioms to just read the file line by line and perform the straight-forward regular expression.
相反,我正在尝试:
#!/usr/bin/perl -i.old # In-place edit, backup as '.old'
use strict;
use warnings;
use Path::Tiny;
my $filename = shift or die "Usage: $0 FILENAME";
my $content = path($filename)->slurp_utf8;
$content =~ s|\\chapter\{.*\{[IVXLCDM]*\s*(.*)\}\}|\\chapter{$1}|gms;
path($filename)->spew_utf8($content);
但是,正则表达式过于贪婪,它在第一个\chapter
声明处开始匹配,并在最后一个chapter
声明处结束.我只想做
However, the regular expression is far too greedy, and begins a match at the first \chapter
declaration and ends it at the last chapter
declaration. All I want is to
- 删除
\texorpdfstring
. - 删除罗马数字
- 删除章节标题的多次出现
以便我替代
\chapter{\texorpdfstring{{I} {The First
Chapter}}{I The First Chapter}}
It was the best of times.
\chapter{\texorpdfstring{{II} {The Second
Chapter}}{II The Second Chapter}}
It was the worst of times.
产生
\chapter{The First Chapter}
It was the best of times.
\chapter{The Second Chapter}
It was the worst of times.
我现在该怎么办?
我更改了演示文本.
如果我对@zdim的理解正确,他会写下替换内容而不忽略括号{}的位置,以使其更易于验证.很公平.我尝试了@zdim的解决方案,但输出:
If I understood @zdim correctly, he wrote down the substitution without escaping the braces {}'s, to make it easier to validate. Fair enough. I tried @zdim's solution but it output:
\chapter{The First
Chapter}
It was the worst of times.
推荐答案
如果只能显示所示的{...}
s/\\chapter{\\texorpdfstring{{ .*? }\s*{ (.*?) }}\s*{.*?}}/\\chapter{$1}/gsx;
或
s/(\\chapter){\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/${1}{$2}/gs;
其中${1}
(用于$1
)用于语法,因为$1{...
将被解释为%1
的值.
where ${1}
(for $1
) is needed for syntax, as $1{...
would be interpreted as a value of %1
.
或者,宁可
s/\\chapter\K{\s*\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/{$1}/gs
,其中 \K
形式的后视会删除之前的匹配项.我仍然留下{
来重新键入,以便获得更清晰的替换零件.
where the \K
form of lookbehind drops previous matches. I still leave {
to retype for a possibly clearer replacement part.
请在可能有空格的地方撒上\s*
.
Please sprinkle this with \s*
where there may be spaces.
还要注意 Path :: Tiny :: edit_utf8
path($filename)->edit_utf8( sub { s/.../.../gs } ); # regex as above
将匿名子应用到已处理的文件,而不是edit_lines
.
which applies the anonymous sub to the slurped file, as opposed to edit_lines
.
如果括号表达式可以更自由地嵌套(例如{\em ... }
等),则需要更加系统的方法.例如,参见 Text :: Balanced 并搜索嵌套定界符".
If the braced expressions can be nested more freely (say with {\em ... }
and such) a far more systemic approach is needed. See for example Text::Balanced and search for "nested delimiters."
一些正则表达式资源
Perl文档
-
perlretut ,教程
perlrequick ,快速入门
perlre ,完整的语法说明
perlreref ,快速参考(其
perlreref, a quick reference (its See Also section is useful on its own)
Stackoverflow
Stackoverflow
-
正则表达式信息 具有资源的入口门户
Regex info An entry portal with resources
参考:此正则表达式是什么意思?大量的常见问题解答列表,其中包含指向SO帖子的链接
Reference: What does this regex mean? A gargantuan list of FAQs with links to SO posts
学习正则表达式 概述,结尾处有一长串资源
Learning Regular expressions An overview with a long list of resources at the end
这篇关于Perl正则表达式,用于提取多行LaTeX章节名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!