Python的BeautifulSoup:从HTML(网页)页面的文本没有显示,而soup.find_all(..) [英] Python BeautifulSoup: Text from the html (web) page not shown while soup.find_all(..)
问题描述
我正在研究的数据刮IndiaBix.com,我试图获取他们的选择和答案一起所有的问题。我成功地让问题和选择,但我无法获取答案。答案格式如下所示:
I was studying data scraping for IndiaBix.com, I was trying to fetch all the questions along with their options and answers. I was successful in getting the questions and the options but I was unable to fetch the answer. The answer format looks like below:
<div class="div-spacer">
<p><span class="ib-green"><b>Answer:</b></span> Option <b class="jq-hdnakqb">A</b></p>
<p><span class="ib-green"><b>Explanation:</b></span></p>
<p> No answer description available for this question. <b><a href="discussion-553">Let us discuss</a></b>. </p>
</div>
在code
<b class="jq-hdnakqb">A</b>
此行,文字'A'是没有得到解析器读取。
for this line, the text 'A' is not getting fetched by the parser.
的IndiaBix页面链接如下:
点击这里
The IndiaBix page link is as follows: Click here
在浏览器中InspectElement文字'A'是可见的,而该分析器不取于beautifulSoup文本。
In browser InspectElement text 'A' is visible whereas that parser is not fetching the text in beautifulSoup.
请帮助我。我是新来的蟒蛇。
Kindly help me with this. I am new to python.
推荐答案
这是一个合作努力。我用 alecxe的BeautifulSoup要点的获取和pretty打印的问题,然后我做必要去混淆,以获得答案:
This is a collaborative effort. I used alecxe's BeautifulSoup gist to fetch and pretty-print the questions, and then I did the de-obfuscation necessary to get the answers:
import requests
from bs4 import BeautifulSoup
url = "http://www.indiabix.com/computer-science/operating-systems-concepts/013001"
data = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
}).content
soup = BeautifulSoup(data, "html.parser")
answers_string = soup.findAll("input", {"id":"hdnAjaxImageCacheKey"})[0]["value"]
answers = answers_string[::-1][17:22].upper()
# iterate over questions
for num, question_block in enumerate(soup.select(".bix-div-container")):
question = question_block.select(".bix-td-qtxt")[0].get_text(strip=True)
print(question + "\n")
# iterate over answers
for answer_block in question_block.select(".bix-tbl-options tr"):
number, answer = answer_block.select(".bix-td-option")
print(number.get_text(), answer.get_text())
print("\nANSWER: " + answers[num])
print("----")
该网站做了一些时髦的评估
ING(中的这个的脚本)并获取由40个字符的字符串在一个隐藏的输入答案:
The site does some funky eval
ing (found in this script) and fetches the answers from a 40 character string in a hidden input:
/* Load Images Indirectly For Better User Experience */
try{eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};while(c--){if(k[c]){p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c])}}return p}('0 5=l.o.h.2(\'\').8().7(\'\').e("m"+"n"+"q");g(5>-1){0 d=$(\'4\'+\'#\'+\'3\'+\'p\'+\'k\'+\'f\'+\'j\').r().6(C).2(\'\').8().7(\'\').6(B).s().2(\'\');0 c=$(\'4\'+\'.\'+\'a\'+\'-\'+\'3\'+\'D\');0 9=$(\'b\'+\'.\'+\'a\'+\'-\'+\'3\'+\'A\'+\'z\');u.t(d,w(i,v){c[i].x=v;9[i].y=v})}',40,40,'var||split|hdn|input|intPos|substr|join|reverse|arrImageViews|jq||arrImagePorts|arrImageCount|indexOf|Cache|if|href||Key|Image|window|xi|baid|location|Ajax|ni|val|toUpperCase|each|jQuery||function|value|innerHTML|qb|ak|17|18|akq'.split('|')))}catch(err){}
请注意傻企图通过评论误导好奇游客:D
Note the silly attempt at misleading the curious visitor via comments :D
/* Load Images Indirectly For Better User Experience */
在评估
'版,简化和减少大小,它看起来像这样:
When eval
'ed, simplified, and cut down to size, it looks like this:
var arrImageCount=$('input'+'#'+'hdn'+'Ajax'+'Image'+'Cache'+'Key').val().substr(18).split('').reverse().join('').substr(17).toUpperCase().split('');
var arrImagePorts=$('input'+'.'+'jq'+'-'+'hdn'+'akq');
var arrImageViews=$('b'+'.'+'jq'+'-'+'hdn'+'ak'+'qb');
jQuery.each(arrImageCount,function(i,v){arrImagePorts[i].value=v;arrImageViews[i].innerHTML=v})
提示:如果你害怕评估
的ING随机JS(你应该),取代评估
与打印
。
Tip: if you're afraid of eval
ing random JS (you should be), replace eval
with print
.
总之,code是pretty简单。它执行以下操作:
Anyway, the code is pretty simple. It does the following:
- 从隐藏的输入字段得到字符串
- 扭转
- 在指数17的5个字符提取到22
- 它们分割到一个数组
- 数组内容,即问题的答案,添加到使用jQuery页面上的5个问题
这是很容易在Python模仿像这样:
This is easily mimicked in Python like so:
answers_string = soup.findAll("input", {"id":"hdnAjaxImageCacheKey"})[0]["value"]
answers = answers_string[::-1][17:22].upper()
这篇关于Python的BeautifulSoup:从HTML(网页)页面的文本没有显示,而soup.find_all(..)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!