Python的BeautifulSoup：从HTML（网页）页面的文本没有显示，而soup.find_all（..） [英] Python BeautifulSoup: Text from the html (web) page not shown while soup.find_all(..)

查看：510 发布时间：2016/8/5 19:05:41 python beautifulsoup

本文介绍了Python的BeautifulSoup：从HTML（网页）页面的文本没有显示，而soup.find_all（..）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在研究的数据刮IndiaBix.com，我试图获取他们的选择和答案一起所有的问题。我成功地让问题和选择，但我无法获取答案。答案格式如下所示：

I was studying data scraping for IndiaBix.com, I was trying to fetch all the questions along with their options and answers. I was successful in getting the questions and the options but I was unable to fetch the answer. The answer format looks like below:

<div class="div-spacer">
                        <p><span class="ib-green"><b>Answer:</b></span> Option <b class="jq-hdnakqb">A</b></p> 
                        <p><span class="ib-green"><b>Explanation:</b></span></p> 
                        <p> No answer description available for this question. <b><a href="discussion-553">Let us discuss</a></b>. </p> 
                    </div>

在code

<b class="jq-hdnakqb">A</b>

此行，文字'A'是没有得到解析器读取。

for this line, the text 'A' is not getting fetched by the parser.

的IndiaBix页面链接如下：
点击这里

The IndiaBix page link is as follows: Click here

在浏览器中InspectElement文字'A'是可见的，而该分析器不取于beautifulSoup文本。

In browser InspectElement text 'A' is visible whereas that parser is not fetching the text in beautifulSoup.

请帮助我。我是新来的蟒蛇。

Kindly help me with this. I am new to python.

推荐答案

这是一个合作努力。我用 alecxe的BeautifulSoup要点的获取和pretty打印的问题，然后我做必要去混淆，以获得答案：

This is a collaborative effort. I used alecxe's BeautifulSoup gist to fetch and pretty-print the questions, and then I did the de-obfuscation necessary to get the answers:

import requests
from bs4 import BeautifulSoup

url = "http://www.indiabix.com/computer-science/operating-systems-concepts/013001"
data = requests.get(url, headers={
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
}).content
soup = BeautifulSoup(data, "html.parser")

answers_string = soup.findAll("input", {"id":"hdnAjaxImageCacheKey"})[0]["value"]
answers = answers_string[::-1][17:22].upper()

# iterate over questions
for num, question_block in enumerate(soup.select(".bix-div-container")):
    question = question_block.select(".bix-td-qtxt")[0].get_text(strip=True)
    print(question + "\n")

    # iterate over answers
    for answer_block in question_block.select(".bix-tbl-options tr"):
        number, answer = answer_block.select(".bix-td-option")

        print(number.get_text(), answer.get_text())

    print("\nANSWER: " + answers[num])
    print("----")

该网站做了一些时髦的评估 ING（中的这个的脚本）并获取由40个字符的字符串在一个隐藏的输入答案：

The site does some funky evaling (found in this script) and fetches the answers from a 40 character string in a hidden input:

/* Load Images Indirectly For Better User Experience */
try{eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};while(c--){if(k[c]){p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c])}}return p}('0 5=l.o.h.2(\'\').8().7(\'\').e("m"+"n"+"q");g(5>-1){0 d=$(\'4\'+\'#\'+\'3\'+\'p\'+\'k\'+\'f\'+\'j\').r().6(C).2(\'\').8().7(\'\').6(B).s().2(\'\');0 c=$(\'4\'+\'.\'+\'a\'+\'-\'+\'3\'+\'D\');0 9=$(\'b\'+\'.\'+\'a\'+\'-\'+\'3\'+\'A\'+\'z\');u.t(d,w(i,v){c[i].x=v;9[i].y=v})}',40,40,'var||split|hdn|input|intPos|substr|join|reverse|arrImageViews|jq||arrImagePorts|arrImageCount|indexOf|Cache|if|href||Key|Image|window|xi|baid|location|Ajax|ni|val|toUpperCase|each|jQuery||function|value|innerHTML|qb|ak|17|18|akq'.split('|')))}catch(err){}

请注意傻企图通过评论误导好奇游客：D

Note the silly attempt at misleading the curious visitor via comments :D

/* Load Images Indirectly For Better User Experience */

在评估'版，简化和减少大小，它看起来像这样：

When eval'ed, simplified, and cut down to size, it looks like this:

var arrImageCount=$('input'+'#'+'hdn'+'Ajax'+'Image'+'Cache'+'Key').val().substr(18).split('').reverse().join('').substr(17).toUpperCase().split('');

var arrImagePorts=$('input'+'.'+'jq'+'-'+'hdn'+'akq');

var arrImageViews=$('b'+'.'+'jq'+'-'+'hdn'+'ak'+'qb');

jQuery.each(arrImageCount,function(i,v){arrImagePorts[i].value=v;arrImageViews[i].innerHTML=v})

提示：如果你害怕评估的ING随机JS（你应该），取代评估与打印。

Tip: if you're afraid of evaling random JS (you should be), replace eval with print.

总之，code是pretty简单。它执行以下操作：

Anyway, the code is pretty simple. It does the following:

从隐藏的输入字段得到字符串

扭转

在指数17的5个字符提取到22

它们分割到一个数组

数组内容，即问题的答案，添加到使用jQuery页面上的5个问题

这是很容易在Python模仿像这样：

This is easily mimicked in Python like so:

answers_string = soup.findAll("input", {"id":"hdnAjaxImageCacheKey"})[0]["value"]
answers = answers_string[::-1][17:22].upper()

这篇关于Python的BeautifulSoup：从HTML（网页）页面的文本没有显示，而soup.find_all（..）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python的BeautifulSoup：从HTML（网页）页面的文本没有显示，而soup.find_all（..） [英] Python BeautifulSoup: Text from the html (web) page not shown while soup.find_all(..)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python的BeautifulSoup：从HTML（网页）页面的文本没有显示，而soup.find_all（..） [英] Python BeautifulSoup: Text from the html (web) page not shown while soup.find_all(..)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭