如何以编程方式检查HTML文档 [英] How do I programmatically inspect a HTML document
问题描述
我有一个包含小型HTML文档的数据库,我需要以编程方式将多个文档插入具有 iText 的PDF文档或具有 Aspose.Words 的Word文档中.我需要保留HTML文档中的所有格式(因此,必须尊重< b>标记,像< span style ="blah">这样的CSS才是必须的).
iText和Aspose大致都能正常工作:
Document document = new Document( Size.A4, Aspect.PORTRAIT );
document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );
因此(我认为),我需要某种HTML解析器,可以检查要插入到文档中的字符串和样式.
有人可以建议一个好的图书馆或明智的方法来解决这个问题吗?平台是Java
HTMLparser 是一个很好的HTML解析器
我用它来解析我的一个项目中的HTML.
您可以编写自己的过滤器来解析所需的HTML,因此
<br>
标签应该很容易解析
您可以在 CssSelectorNodeFilter 中解析CSS. /p>
I have a database full of small HTML documents and I need to programmatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).
Both iText and Aspose work (roughly) along the lines:
Document document = new Document( Size.A4, Aspect.PORTRAIT );
document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );
Therefore (I think) I need some kind of HTML parser which will I can inspect for strings and styles to insert into my document.
Can anybody suggest a good library or a sensible approach to this problem? Platform is Java
HTMLparser is a good HTML parser.
I have used this to parse HTML on one of my projects.
You can write your own filters to parse the HTML for what you want, so the
<br>
tag shouldn't be difficult to parse out
Yo can parse out CSS usin the CssSelectorNodeFilter
这篇关于如何以编程方式检查HTML文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!