如何在Java中提取网页文本内容? [英] how to extract web page textual content in java?

查看:109
本文介绍了如何在Java中提取网页文本内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种使用jdk或其他库从网页(最初为html)提取文本的方法.请帮助

i am looking for a method to extract text from web page (initially html) using jdk or another library . please help

谢谢

推荐答案

使用

Use a HTML parser if at all possible; there are many available for Java.

或者您可以像许多人一样使用正则表达式.但是,通常不建议这样做,除非您进行的处理非常简单.

Or you can use regex like many people do. This is generally not advisable, however, unless you're doing very simplistic processing.

  • Java HTML Parsing
  • Which Html Parser is best?
  • Any good Java HTML parsers?
  • recommendations for a java HTML parser/editor
  • What HTML parsing libraries do you recommend in Java

文本提取:

  • Text Extraction from HTML Java
  • Text extraction with java html parsers

标签剥离:

  • Stripping HTML tags in Java
  • How to strip HTML attributes except "src" and "alt" in JAVA
  • Removing HTML from a Java String

这篇关于如何在Java中提取网页文本内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆