Perl正则表达式,用于提取多行LaTeX章节名称 [英] Perl Regular Expression for extracting multi-line LaTeX chapter name

查看:211
本文介绍了Perl正则表达式,用于提取多行LaTeX章节名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难弄清楚如何执行正则表达式替换来清除LaTeX文件中的某些文本. LaTeX文件看起来像

I am having a hard time figuring out how to perform a regex substitution to clean up some text in a LaTeX file. The LaTeX file looks like

\chapter{\texorpdfstring{{II} {The Chapter 
Title}}{II The Chapter Title}}

令人讨厌的是,这是一个多行章节声明,新行实际上可以出现在任何地方.我不能使用常见的<>习惯用法来逐行读取文件并执行简单明了的正则表达式.

Annoyingly, this is a multi-line chapter declaration, and the new line can occur virtually anywhere. I can't use the common <> idioms to just read the file line by line and perform the straight-forward regular expression.

相反,我正在尝试:

#!/usr/bin/perl -i.old     # In-place edit, backup as '.old'
use strict;
use warnings;

use Path::Tiny;

my $filename = shift or die "Usage: $0 FILENAME";
my $content = path($filename)->slurp_utf8;

$content =~ s|\\chapter\{.*\{[IVXLCDM]*\s*(.*)\}\}|\\chapter{$1}|gms;
path($filename)->spew_utf8($content);

但是,正则表达式过于贪婪,它在第一个\chapter声明处开始匹配,并在最后一个chapter声明处结束.我只想做

However, the regular expression is far too greedy, and begins a match at the first \chapter declaration and ends it at the last chapter declaration. All I want is to

  1. 删除\texorpdfstring.
  2. 删除罗马数字
  3. 删除章节标题的多次出现

以便我替代

\chapter{\texorpdfstring{{I} {The First 
Chapter}}{I The First Chapter}}

It was the best of times.

\chapter{\texorpdfstring{{II} {The Second 
Chapter}}{II The Second Chapter}}

It was the worst of times.

产生

\chapter{The First Chapter}

It was the best of times.

\chapter{The Second Chapter}

It was the worst of times.

我现在该怎么办?

我更改了演示文本.

如果我对@zdim的理解正确,他会写下替换内容而不忽略括号{}的位置,以使其更易于验证.很公平.我尝试了@zdim的解决方案,但输出:

If I understood @zdim correctly, he wrote down the substitution without escaping the braces {}'s, to make it easier to validate. Fair enough. I tried @zdim's solution but it output:

\chapter{The First
Chapter}

It was the worst of times.

推荐答案

如果只能显示所示的{...}

s/\\chapter{\\texorpdfstring{{ .*? }\s*{ (.*?) }}\s*{.*?}}/\\chapter{$1}/gsx;

s/(\\chapter){\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/${1}{$2}/gs;

其中${1}(用于$1)用于语法,因为$1{...将被解释为%1的值.

where ${1} (for $1) is needed for syntax, as $1{... would be interpreted as a value of %1.

或者,宁可

s/\\chapter\K{\s*\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/{$1}/gs

,其中 \K形式的后视会删除之前的匹配项.我仍然留下{来重新键入,以便获得更清晰的替换零件.

where the \K form of lookbehind drops previous matches. I still leave { to retype for a possibly clearer replacement part.

请在可能有空格的地方撒上\s*.

Please sprinkle this with \s* where there may be spaces.

还要注意 Path :: Tiny :: edit_utf8

path($filename)->edit_utf8( sub { s/.../.../gs } );  # regex as above

将匿名子应用到已处理的文件,而不是edit_lines.

which applies the anonymous sub to the slurped file, as opposed to edit_lines.

如果括号表达式可以更自由地嵌套(例如{\em ... }等),则需要更加系统的方法.例如,参见 Text :: Balanced 并搜索嵌套定界符".

If the braced expressions can be nested more freely (say with {\em ... } and such) a far more systemic approach is needed. See for example Text::Balanced and search for "nested delimiters."

一些正则表达式资源

Perl文档

perlrequick ,快速入门

perlre ,完整的语法说明

perlreref ,快速参考(其

perlreref, a quick reference (its See Also section is useful on its own)

Stackoverflow

Stackoverflow

  • Regex info   An entry portal with resources

参考:此正则表达式是什么意思?大量的常见问题解答列表,其中包含指向SO帖子的链接

Reference: What does this regex mean? A gargantuan list of FAQs with links to SO posts

学习正则表达式 概述,结尾处有一长串资源

Learning Regular expressions   An overview with a long list of resources at the end

Regular-Expressions.info

这篇关于Perl正则表达式,用于提取多行LaTeX章节名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆