如何在OCR目录的中间用数字固定虚线? [英] How to fix broken lines with numbers in middle in OCR table of contents?

查看:112
本文介绍了如何在OCR目录的中间用数字固定虚线?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ORC目录中有一些虚线,在\t之后和\n之前可能有数字,也可能没有数字.

There are some broken lines in an ORC table of contents, which may or may not have number after \t and before \n.

输入:

    9.1 The Euclidean Group in Two-Dimensional  152
    Space E2
CHAPTER 10: THE LORENTZ AND POINCARÉ GROUPS,    
    AND SPACE-TIME SYMMETRIES   173

如果数字夹在两个字母之间(示例中为152),则该数字为上一节的页码,应将其删除.如果在它后面是另一个编号(下一节的编号),则它是正确的页码(此处为173),应予以保留.这是所需的输出:

If a number is sandwiched between two letters (152 in the example) then it is the page number of the previous section and should be deleted. If after it is another number (number of the next section) then it is the correct page number (173 here) and should be kept. Here's the desired output:

    9.1 The Euclidean Group in Two-Dimensional Space E2
CHAPTER 10: THE LORENTZ AND POINCARÉ GROUPS, AND SPACE-TIME SYMMETRIES  173

我的尝试:

([a-zA-Z])(\t[0-9]*\n\t)((?![P])[A-Z])

但是npp一直说找不到文本,即使它在 https://www.regextester中也能正常工作.com .如何修复它们?

but npp keeps saying it can't find the text, even though it works fine in https://www.regextester.com. How to fix them to normal?

推荐答案

您可以使用

(\S)\t[0-9]*\R\t+

,并替换为$1(第1组值占位符).

and replace with $1 (Group 1 value placeholder).

详细信息

  • (\S)-第1组:任何非空白字符
  • \t-一个标签
  • [0-9]*-0位数以上
  • \R-一个换行符序列
  • \t+-1个或多个选项卡(或\h+-1个以上水平空格)
  • (\S) - Group 1: any non-whitespace char
  • \t - a tab
  • [0-9]* - 0+ digits
  • \R - a line break sequence
  • \t+ - 1 or more tabs (or \h+ - 1+ horizontal whitespaces)

> 正则表达式演示

这篇关于如何在OCR目录的中间用数字固定虚线?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆