如何以编程方式检查HTML文档 [英] How do I programmatically inspect a HTML document

查看:78
本文介绍了如何以编程方式检查HTML文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含小型HTML文档的数据库,我需要以编程方式将多个文档插入具有 iText 的PDF文档或具有 Aspose.Words 的Word文档中.我需要保留HTML文档中的所有格式(因此,必须尊重< b>标记,像< span style ="blah">这样的CSS才是必须的).

iText和Aspose大致都能正常工作:

Document document = new Document( Size.A4, Aspect.PORTRAIT );

document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );

因此(我认为),我需要某种HTML解析器,可以检查要插入到文档中的字符串和样式.

有人可以建议一个好的图书馆或明智的方法来解决这个问题吗?平台是Java

解决方案

HTMLparser 是一个很好的HTML解析器

我用它来解析我的一个项目中的HTML.

您可以编写自己的过滤器来解析所需的HTML,因此 <br>标签应该很容易解析

您可以在 CssSelectorNodeFilter 中解析CSS. /p>

I have a database full of small HTML documents and I need to programmatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).

Both iText and Aspose work (roughly) along the lines:

Document document = new Document( Size.A4, Aspect.PORTRAIT );

document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );

Therefore (I think) I need some kind of HTML parser which will I can inspect for strings and styles to insert into my document.

Can anybody suggest a good library or a sensible approach to this problem? Platform is Java

解决方案

HTMLparser is a good HTML parser.

I have used this to parse HTML on one of my projects.

You can write your own filters to parse the HTML for what you want, so the <br> tag shouldn't be difficult to parse out

Yo can parse out CSS usin the CssSelectorNodeFilter

这篇关于如何以编程方式检查HTML文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆