以编程方式将LaTeX代码转换/解析为纯文本 [英] Programmatically converting/parsing LaTeX code to plain text

查看:1113
本文介绍了以编程方式将LaTeX代码转换/解析为纯文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在C ++/Python中有几个代码项目,其中LaTeX格式的描述和标签用于生成PDF文档或使用LaTeX + pstricks制作的图形.但是,我们也有一些纯文本输出,例如文档的HTML版本(我已经有编写此标记的代码)和未启用TeX的绘图渲染器.

I have a couple of code projects in C++/Python in which LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks. However, we also have some plain text outputs, such as an HTML version of the documentation (I already have code to write minimal markup for that) and a non-TeX-enabled plot renderer.

对于这些,我想消除例如代表物理单位.这包括不间断(稀疏)空格,\ text,\ mathrm等.将诸如\ frac {#1} {#2}之类的内容解析为#1/#2以获得纯文本输出(以及将MathJax用于HTML).由于我们目前拥有的系统,我需要能够从Python做到这一点,即 ideally 我正在寻找一个Python包,但我正在寻找一个非Python可执行文件可以从Python调用并捕获输出字符串也可以.

For these I would like to eliminate the TeX markup that is necessary for e.g. representing physical units. This includes non-breaking (thin) spaces, \text, \mathrm etc. It would also be nice to parse down things like \frac{#1}{#2} into #1/#2 for the plain text output (and use MathJax for the HTML). Due to the system that we've got at the moment, I need to be able to do this from Python, i.e. ideally I'm looking for a Python package, but a non-Python executable which I can call from Python and catch the output string would also be fine.

我知道TeX StackExchange网站上的类似问题,但是并没有任何真正的程序解决方案:我研究了detex,plasTeX和pytex,它们似乎都死了,并没有真正满足我的需要:TeX的程序化转换字符串转换为代表性的纯文本字符串.

I'm aware of the similar question on the TeX StackExchange site, but there weren't any really programmatic solutions to that: I've looked at detex, plasTeX and pytex, which they all seem a bit dead and don't really do what I need: programmatic conversion of a TeX string to a representative plain text string.

我可以尝试使用例如pyparsing,但是a)可能会带来很多麻烦并会有所帮助,并且b)肯定有人曾经尝试过这种方法,或者知道可以使用TeX本身以获得更好结果的方法吗?

I could try writing a basic TeX parser using e.g. pyparsing, but a) that might be pitfall-laden and help would be appreciated and b) surely someone has tried that before, or knows of a way to hook into TeX itself to get a better result?

更新:感谢您提供所有答案……确实确实有点尴尬!我可以用比一般LaTeX解析少的方法来做,但是考虑使用解析器而不是循环中加载正则表达式的原因是我希望能够很好地处理嵌套宏和多参数宏,并获得大括号匹配才能正常工作.然后我可以首先减少与\ txt和\ mathrm等与txt不相关的宏,并最后处理\ frac等与txt相关的宏……甚至可以加上适当的括号!好吧,我可以做梦...就目前而言,正则表达式并没有做得如此糟糕.

Update: Thanks for all the answers... it does indeed seem to be a bit of an awkward request! I can make do with less than general parsing of LaTeX, but the reason for considering a parser rather than a load of regexes in a loop is that I want to be able to handle nested macros and multi-arg macros nicely, and get the brace matching to work properly. Then I can e.g. reduce txt-irrelevant macros like \text and \mathrm first, and handle txt-relevant ones like \frac last... maybe even with appropriate parentheses! Well, I can dream... for now regexes are not doing such a terrible job.

推荐答案

我理解这是一篇过时的文章,但是由于该文章经常出现在Latex-python-parsing搜索中(如 https://github.com/alvinwan/texsoup .摘自自述文件,这里是示例文本以及如何通过TexSoup与之交互.

I understand this is an old post, but since this post comes up often in latex-python-parsing searches (as evident by Extract only body text from arXiv articles formatted as .tex), leaving this here for folks down the line: Here's a LaTeX parser in Python that supports search over and modification of the parse tree, https://github.com/alvinwan/texsoup. Taken from the README, here is sample text and how you can interact with it via TexSoup.

from TexSoup import TexSoup
soup = TexSoup("""
\begin{document}

\section{Hello \textit{world}.}

\subsection{Watermelon}

(n.) A sacred fruit. Also known as:

\begin{itemize}
\item red lemon
\item life
\end{itemize}

Here is the prevalence of each synonym.

\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}

\end{document}
""")

以下是导航解析树的方法.

Here's how to navigate the parse tree.

>>> soup.section  # grabs the first `section`
\section{Hello \textit{world}.}
>>> soup.section.name
'section'
>>> soup.section.string
'Hello \\textit{world}.'
>>> soup.section.parent.name
'document'
>>> soup.tabular
\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}
>>> soup.tabular.args[0]
'c c'
>>> soup.item
\item red lemon
>>> list(soup.find_all('item'))
[\item red lemon, \item life]

免责声明:我写了这个lib,但这是出于类似的原因.关于Little Bobby Tales的帖子(关于def),TexSoup不处理定义.

Disclaimer: I wrote this lib, but it was for similar reasons. Regarding the post by Little Bobby Tales (regarding def), TexSoup doesn't handle definitions.

这篇关于以编程方式将LaTeX代码转换/解析为纯文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆