抓取时如何在html页面中获取评论? [英] how to get the comments in a html page while scraping?

查看:56
本文介绍了抓取时如何在html页面中获取评论?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是问题.我试图在网站上刮取这个Facebook的生日日期,当我在浏览器中看到页面源时,它以类名 class的 div 中的html注释形式向我显示了生日日期="hidden_​​elem" .

Here's the issue . im trying to scrape this facebook about page for the birthday date and when I see the page source in the browser , it shows me the birthday date as a comment in html within a div of classname class="hidden_elem" .

当我在使用(selenium,scrapy,requests)的get请求中看到此页面的源代码时,可能得到的全部是,我得到的只是 div class ="hidden_​​elem" ,该注释无处可见,更不用说对其进行语法分析了.

It might that becoz of this, when I see the source code of this page in my get request using (selenium , scrapy , requests) all I get just a div with class="hidden_elem" and that comment is nowhere to be seen let alone parsing it for info.

所以如何获取此文本,并在可能的情况下也请说明如何获取生日日期.

So how to get this text and if possible please show how to get the birthday dates too.

在Facebook页面上,可能有一些JavaScript的事情是设计导致的.如何解决这个问题?

There might be some javascript things which is trickily causing this by design on the facebook page. how to get around this ?

这里是我试图获取生日日期的URL. https://www.facebook.com/profile.php?id=100004456147835& sk =关于

Here is the URL from which im trying to get the birthday dates . https://www.facebook.com/profile.php?id=100004456147835&sk=about

在浏览器的源页面中,它看起来像这样:-

From the source page of the browser it looks like this :-

<div class="hidden_elem"><code id="u_0_2g"><!-- <ul class="uiList _54nz _4kg _4kt" data-pnref="about"><li><div class="_5aj7"><div class="_4bl9"><div class="_54n- _2pi3"><div id="u_0_2e"></div></div></div><div class="_4bl7"><div class="_4ms4" id="u_0_2a"><div class="clearfix _ikh _5c0g" data-pnref="overview" id="u_0_2f"><div class="_4bl7"><ul class="uiList _1pi3 _4kg _6-h _703 _4ks"><li class="_3pw9 _2pi4"><div class="clearfix _4bbo" role="button" tabindex="0"><div class="_5rsw _3-91 _8o lfloat _ohe"><i class="_5rsx img sp_yw06AF9sktb sx_344683"></i></div><div class="_42ef"><div class="_6a"><div class="_6a _6b" style="height:36px"></div><div class="_6a _6b"><span class="_50f8 _2iem">No workplaces to show</span></div></div></div></div></li><li id="u_0_2b"><div class="clearfix _5y02" data-overviewsection="education" role="button" tabindex="0"><a class="_5uat _3-91 _8o lfloat _ohe" tabindex="-1" aria-hidden="true" href="https://www.facebook.com/pages/Cambridge-Institute-of-technolagy/133870693705509" data-hovercard="/ajax/hovercard/page.php?id=133870693705509" data-hovercard-prefer-more-content-show="1"><img class="_s0 _4ooo _54ru img" src="https://scontent.fblr6-1.fna.fbcdn.net/v/t1.0-1/c9.0.32.32/p32x32/580846_10149999285985791_1565762244_n.png?oh=d4ccc6a667e53f20db9cf60c0742f989&amp;oe=5B1420C5" alt="" aria-label="Cambridge Institute of technolagy" role="img" /></a><div class="_42ef"><div class="_6a _5u5j _6b"><div class="_c24 _50f4">Studies at <a class="profileLink" href="https://www.facebook.com/pages/Cambridge-Institute-of-technolagy/133870693705509" data-hovercard="/ajax/hovercard/page.php?id=133870693705509" data-hovercard-prefer-more-content-show="1">Cambridge Institute of technolagy</a></div><div><div><div class="_50f8 _2ieq"><div class="fsm fwn fcg">Past: <a class="profileLink" href="https://www.facebook.com/deekshaintegrated/" data-hovercard="/ajax/hovercard/page.php?id=176180289071224" data-hovercard-prefer-more-content-show="1">Deeksha Integrated</a> and <a class="profileLink" href="https://www.facebook.com/pages/chethana-vidya-mandiratumkur/378826618888908" data-hovercard="/ajax/hovercard/page.php?id=378826618888908" data-hovercard-prefer-more-content-show="1">chethana vidya mandira,tumkur</a></div></div></div></div></div></div></div></li><li id="u_0_2c"><div class="clearfix _5y02" data-overviewsection="places" role="button" tabindex="0"><a class="_5uat _3-91 _8o lfloat _ohe" tabindex="-1" aria-hidden="true" href="https://www.facebook.com/pages/Bangalore-India/106377336067638" data-hovercard="/ajax/hovercard/page.php?id=106377336067638" data-hovercard-prefer-more-content-show="1"><img class="_s0 _4ooo _54ru img" src="https://external.fblr6-1.fna.fbcdn.net/safe_image.php?d=AQCKH3kcP1-A2NPe&amp;w=32&amp;h=32&amp;url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F8%2F80%2FBangaloreMontage.png&amp;cfs=1&amp;fallback=hub_city&amp;f&amp;_nc_hash=AQDbJ1ytdhSz3E8E" alt="" aria-label="Bangalore, India" role="img" /></a><div class="_42ef"><div class="_6a _5u5j _6b"><div class="_c24 _50f4">Lives in <a class="profileLink" href="https://www.facebook.com/pages/Bangalore-India/106377336067638" data-hovercard="/ajax/hovercard/page.php?id=106377336067638" data-hovercard-prefer-more-content-show="1">Bangalore, India</a></div><div><div><div class="_50f8 _2ieq"><div class="fsm fwn fcg"><span id="u_0_2d">From <span class="fwb"><a class="profileLink" href="https://www.facebook.com/pages/Tumkur/106525352717093" data-hovercard="/ajax/hovercard/page.php?id=106525352717093" data-hovercard-prefer-more-content-show="1">Tumkur</a></span></span></div></div></div></div></div></div></div></li><li class="_3pw9 _2pi4"><div class="clearfix _4bbo" role="button" tabindex="0"><div class="_5rsw _3-91 _8o lfloat _ohe"><i class="_5rsx img sp_yw06AF9sktb sx_585866"></i></div><div class="_42ef"><div class="_6a"><div class="_6a _6b" style="height:36px"></div><div class="_6a _6b"><span class="_50f8 _2iem">No relationship info to show</span></div></div></div></div></li></ul></div><div class="_4bl9 _zu9"><ul class="uiList _5yql _4kg" data-overviewsection="contact_basic" role="button" tabindex="0"><li class="_4tnv _2pif"><div class="clearfix _ikh"><div class="_4bl7"><div class="_pvf _5pmc"><i class="img sp_yw06AF9sktb sx_e0cf75"></i></div></div><div class="_4bl9 _2pis _2dbl"><span class="_c24 _2ieq"><div><span class="accessible_elem">Birthday</span></div><div>April 28, 1998</div></span></div></div></li></ul></div></div></div></div></div></li></ul> --></code></div>

从脚本中获取页面源代码时,只有< div class ="hidden_​​elem"></div> 即将到来.

When I get the page source from my script , only <div class="hidden_elem"> </div> this is coming .

推荐答案

您需要使用以下命令向下滚动页面:

You need to scroll down the page with:

String s = "window.scrollBy(0,document.body.scrollHeight || document.documentElement.scrollHeight)";
            ScriptResult sr = page.executeJavaScript(s);
            LOG.info("Result= " + sr.getJavaScriptResult());

之后,您将能够获取对象的"hidden_​​elem"列表:

After that, you will be able to get the "hidden_elem" list of objects:

String xpathHiddenElem = "//div[contains(@class, 'hidden_elem')]";
List<Object> responseHiddenElem = page.getByXPath(xpathHiddenElem);
LOG.info("responseHiddenElem: {}", responseHiddenElem);
if (responseHiddenElem != null && responseHiddenElem.size() > 0) {
    for (Object element : responseHiddenElem) {
        HtmlDivision elementCasted = (HtmlDivision) element;
        LOG.info("elementContent: {}", elementCasted.getTextContent());
        LOG.info("elementContent: {}", elementCasted.asText());
        LOG.info("elementContent: {}", elementCasted.getTagName());
        LOG.info("elementContent: {}", elementCasted.getIndex());
    }
}

这篇关于抓取时如何在html页面中获取评论?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆