我如何以编程方式检查HTML文档 [英] How do I programatically inspect a HTML document
问题描述
我有一个包含小型HTML文档的数据库,我需要以编程方式将几个插入到带有 iText 的PDF文档或带有 Aspose.Words 的Word文档中。 。我需要保留HTML文档中的任何格式(在合理范围内,尊重< b>标记是必须的,像< span style =blah>这样的CSS是一个不错的选择)。
I have a database full of small HTML documents and I need to programatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).
iText和Aspose都可以(粗略地)工作:
Both iText and Aspose work (roughly) along the lines:
Document document = new Document( Size.A4, Aspect.PORTRAIT );
document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );
因此(我认为)我需要某种HTML解析器,我可以检查字符串和样式插入到我的文档中。
Therefore (I think) I need some kind of HTML parser which will I can inspect for strings and styles to insert into my document.
任何人都可以建议一个好的库或明智的方法解决这个问题吗?平台是Java
Can anybody suggest a good library or sensible approach to this problem? Platform is Java
推荐答案
HTMLparser 是一个很好的HTML解析器。
HTMLparser is a good HTML parser.
我用它来解析我的一个项目上的HTML。
I have used this to parse HTML on one of my projects.
你可以编写自己的过滤器来解析你想要的HTML,所以
< br>
标签应该不难解析
You can write your own filters to parse the HTML for what you want, so the
<br>
tag shouldn't be difficult to parse out
这篇关于我如何以编程方式检查HTML文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!