如何从HTML提取内容 [英] How to Extract Content From HTML

查看:221
本文介绍了如何从HTML提取内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将HTML作为字符串,我只想从中提取"post_titles".这是HTML字符串:

I have HTML as string and i want to extract just "post_titles" from it. this is the HTML string:

<div class="hidden" id="inline_49">
<div class="post_title">Single parenting</div>
<div class="post_name">single-parenting</div>
<div class="post_author">90307285</div>
<div class="comment_status">open</div>
<div class="ping_status">open</div>
<div class="_status">publish</div>
<div class="jj">20</div>
<div class="mm">07</div>
<div class="aa">2015</div>
<div class="hh">00</div>
<div class="mn">52</div>
<div class="ss">33</div>

这是我想提取的帖子,标题为单身育儿".这就是我正在使用的:

This has the post title as "Single parenting" which is what i want to extract. This is what i am using :

Elements link = doc.select("div[class=post_title]");
String title = link.text();

但这给出了一个空白字符串.我也尝试过:

But this is giving a blank string. I also tried:

Elements link = doc.select("div[id=inline_49]").select("div[class=post_title]");
String title = link.text();

这也给出了一个空白字符串.请帮助我提取标题所需的选择器.

This is also giving a blank string. Please help me what selector exactly I need to use to extract the title.

推荐答案

您的请求中必须包含cookie. 检查以下Java代码:

You must include a cookie in your request. Check this Java code:

try {

            String url = "https://ssblecturate.wordpress.com/wp-login.php";

            Connection.Response response = Jsoup.connect(url)
                    .data("log", "your_login_here") // your wordpress login
                    .data("pwd", "your_password_here") // your wordpress password
                    .data("rememberme", "forever")
                    .data("wp-submit", "Log In")
                    .method(Connection.Method.POST)
                    .followRedirects(true)
                    .execute();

            Document document = Jsoup.connect("https://ssblecturate.wordpress.com/wp-admin/edit.php")
                    .cookies(response.cookies())
                    .get();

            Element titleElement= document.select("div[class=post_title]").first();
            System.out.println(titleElement.text());

        } catch (IOException e) {
            e.printStackTrace();
        }

这篇关于如何从HTML提取内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆