Python的BeautifulSoup:从HTML(网页)页面的文本没有显示,而soup.find_all(..) [英] Python BeautifulSoup: Text from the html (web) page not shown while soup.find_all(..)

查看:510
本文介绍了Python的BeautifulSoup:从HTML(网页)页面的文本没有显示,而soup.find_all(..)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究的数据刮I​​ndiaBix.com,我试图获取他们的选择和答案一起所有的问题。我成功地让问题和选择,但我无法获取答案。答案格式如下所示:

I was studying data scraping for IndiaBix.com, I was trying to fetch all the questions along with their options and answers. I was successful in getting the questions and the options but I was unable to fetch the answer. The answer format looks like below:

<div class="div-spacer">
                        <p><span class="ib-green"><b>Answer:</b></span> Option <b class="jq-hdnakqb">A</b></p> 
                        <p><span class="ib-green"><b>Explanation:</b></span></p> 
                        <p> No answer description available for this question. <b><a href="discussion-553">Let us discuss</a></b>. </p> 
                    </div>

在code

<b class="jq-hdnakqb">A</b>

此行,文字'A'是没有得到解析器读取。

for this line, the text 'A' is not getting fetched by the parser.

的IndiaBix页面链接如下:
点击这里

The IndiaBix page link is as follows: Click here

在浏览器中InspectElement文字'A'是可见的,而该分析器不取于beautifulSoup文本。

In browser InspectElement text 'A' is visible whereas that parser is not fetching the text in beautifulSoup.

请帮助我。我是新来的蟒蛇。

Kindly help me with this. I am new to python.

推荐答案

这是一个合作努力。我用 alecxe的BeautifulSoup要点的获取和pretty打印的问题,然后我做必要去混淆,以获得答案:

This is a collaborative effort. I used alecxe's BeautifulSoup gist to fetch and pretty-print the questions, and then I did the de-obfuscation necessary to get the answers:

import requests
from bs4 import BeautifulSoup

url = "http://www.indiabix.com/computer-science/operating-systems-concepts/013001"
data = requests.get(url, headers={
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
}).content
soup = BeautifulSoup(data, "html.parser")

answers_string = soup.findAll("input", {"id":"hdnAjaxImageCacheKey"})[0]["value"]
answers = answers_string[::-1][17:22].upper()

# iterate over questions
for num, question_block in enumerate(soup.select(".bix-div-container")):
    question = question_block.select(".bix-td-qtxt")[0].get_text(strip=True)
    print(question + "\n")

    # iterate over answers
    for answer_block in question_block.select(".bix-tbl-options tr"):
        number, answer = answer_block.select(".bix-td-option")

        print(number.get_text(), answer.get_text())

    print("\nANSWER: " + answers[num])
    print("----")

该网站做了一些时髦的评估 ING(中的这个的脚本)并获取由40个字符的字符串在一个隐藏的输入答案:

The site does some funky evaling (found in this script) and fetches the answers from a 40 character string in a hidden input:

/* Load Images Indirectly For Better User Experience */
try{eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};while(c--){if(k[c]){p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c])}}return p}('0 5=l.o.h.2(\'\').8().7(\'\').e("m"+"n"+"q");g(5>-1){0 d=$(\'4\'+\'#\'+\'3\'+\'p\'+\'k\'+\'f\'+\'j\').r().6(C).2(\'\').8().7(\'\').6(B).s().2(\'\');0 c=$(\'4\'+\'.\'+\'a\'+\'-\'+\'3\'+\'D\');0 9=$(\'b\'+\'.\'+\'a\'+\'-\'+\'3\'+\'A\'+\'z\');u.t(d,w(i,v){c[i].x=v;9[i].y=v})}',40,40,'var||split|hdn|input|intPos|substr|join|reverse|arrImageViews|jq||arrImagePorts|arrImageCount|indexOf|Cache|if|href||Key|Image|window|xi|baid|location|Ajax|ni|val|toUpperCase|each|jQuery||function|value|innerHTML|qb|ak|17|18|akq'.split('|')))}catch(err){}

请注意傻企图通过评论误导好奇游客:D

Note the silly attempt at misleading the curious visitor via comments :D

/* Load Images Indirectly For Better User Experience */

评估'版,简化和减少大小,它看起来像这样:

When eval'ed, simplified, and cut down to size, it looks like this:

var arrImageCount=$('input'+'#'+'hdn'+'Ajax'+'Image'+'Cache'+'Key').val().substr(18).split('').reverse().join('').substr(17).toUpperCase().split('');

var arrImagePorts=$('input'+'.'+'jq'+'-'+'hdn'+'akq');

var arrImageViews=$('b'+'.'+'jq'+'-'+'hdn'+'ak'+'qb');

jQuery.each(arrImageCount,function(i,v){arrImagePorts[i].value=v;arrImageViews[i].innerHTML=v})

提示:如果你害怕评估的ING随机JS(你应该),取代评估打印

Tip: if you're afraid of evaling random JS (you should be), replace eval with print.

总之,code是pretty简单。它执行以下操作:

Anyway, the code is pretty simple. It does the following:


  • 从隐藏的输入字段得到字符串

  • 扭转

  • 在指数17的5个字符提取到22

  • 它们分割到一个数组

  • 数组内容,即问题的答案,添加到使用jQuery页面上的5个问题

这是很容易在Python模仿像这样:

This is easily mimicked in Python like so:

answers_string = soup.findAll("input", {"id":"hdnAjaxImageCacheKey"})[0]["value"]
answers = answers_string[::-1][17:22].upper()

这篇关于Python的BeautifulSoup:从HTML(网页)页面的文本没有显示,而soup.find_all(..)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆