显示与HTML开发人员工具不同的HTML的Requests.get [英] Requests.get showing different HTML than Chrome's Developer Tool

查看:58
本文介绍了显示与HTML开发人员工具不同的HTML的Requests.get的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python(特别是jupyter笔记本)开发一种网络抓取工具,该工具可以抓取一些房地产页面并保存价格,地址等数据.

I am working on a web scraping tool using python (specifically jupyter notebook) that scrapes a few real estate pages and saves the data like price, adress etc.

对于我选出的其中一个页面,它的工作正常,但是当我尝试抓取此页面时:

It is working just fine for one of the pages I picked out but when I try to scrape this page: sreality.cz (sorry, the page is in Czech but the actual content is not that important now) using reguests.get() I get this result:

<!doctype html>
<html lang="{{ html.lang }}" ng-app="sreality" ng-controller="MainCtrl">
<head>
	<meta charset="utf-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">

	<!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS --->
	<title ng:bind-template="{{metaSeo.title}}">Sreality.cz • reality a nemovitosti z celé ČR</title>
	<meta name="description" content="Největší nabídka nemovitostí v ČR. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu. Sreality.cz">
	<meta property="og:title"       content="Sreality.cz • reality a nemovitosti z celé ČR">
	<meta property="og:type"        content="website">
	<meta property="og:image"       content="https://www.sreality.cz/img/sreality-logo-og.png">
	<meta property="og:description" content="Největší nabídka nemovitostí v ČR. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu. Sreality.cz">
	<meta property="og:url"         content="https://www.sreality.cz/">

	<meta ng-if="metaStatus.value" name="szn:status" content="{{metaStatus.value}}">

	<meta http-equiv="imagetoolbar" content="no">

	<link rel="icon" sizes="16x16 32x32 48x48 64x64" href="/img/icons/favicon.ico">
	<link rel="apple-touch-icon" sizes="57x57" href="/img/icons/apple-touch-icon-57x57.png?3">
	<link rel="apple-touch-icon" sizes="60x60" href="/img/icons/apple-touch-icon-60x60.png?3">
	<link rel="apple-touch-icon" sizes="72x72" href="/img/icons/apple-touch-icon-72x72.png?3">
	<link rel="apple-touch-icon" sizes="76x76" href="/img/icons/apple-touch-icon-76x76.png?3">
	<link rel="apple-touch-icon" sizes="114x114" href="/img/icons/apple-touch-icon-114x114.png?3">
	<link rel="apple-touch-icon" sizes="120x120" href="/img/icons/apple-touch-icon-120x120.png?3">
	<link rel="apple-touch-icon" sizes="144x144" href="/img/icons/apple-touch-icon-144x144.png?3">
	<link rel="apple-touch-icon" sizes="152x152" href="/img/icons/apple-touch-icon-152x152.png?3">
	<link rel="apple-touch-icon" sizes="180x180" href="/img/icons/apple-touch-icon-180x180.png?3">
	<link rel="icon" type="image/png" sizes="192x192"  href="/img/icons/android-chrome-192x192.png">
	<link rel="icon" type="image/png" sizes="32x32" href="/img/icons/favicon-32x32.png">
	<link rel="icon" type="image/png" sizes="96x96" href="/img/icons/favicon-96x96.png">
	<link rel="icon" type="image/png" sizes="16x16" href="/img/icons/favicon-16x16.png">
	<link rel="manifest" href="/img/icons/android-chrome-manifest.json">
	<meta name="msapplication-TileColor" content="#2b5797">
	<meta name="msapplication-TileImage" content="/img/icons/ms-icon-144x144.png">
	<meta name="msapplication-config" content="/img/icons/browserconfig.xml" />

	<link rel="alternate" type="application/rss+xml" ng-href="{{ rss.url }}" ng-if="rss.url">
	<link ng-repeat="lang in metaSeo.languages" rel="alternate" hreflang="{{lang.code}}" ng-href="{{lang.url}}">

	<link rel="stylesheet" href="/css/all.css?2e96626">

	<!-- Begin Inspectlet Embed Code -->
	<script type="text/javascript" id="inspectletjs">
	window.__insp = window.__insp || [];
	__insp.push(['wid', 821249485]);
	__insp.push(["virtualPage"]);
	(function() {
	function ldinsp(){if(typeof window.__inspld != "undefined") return; window.__inspld = 1; var insp = document.createElement('script'); insp.type = 'text/javascript'; insp.async = true; insp.id = "inspsync"; insp.src = ('https:' == document.location.protocol ? 'https' : 'http') + '://cdn.inspectlet.com/inspectlet.js'; var x = document.getElementsByTagName('script')[0]; x.parentNode.insertBefore(insp, x); };
	setTimeout(ldinsp, 500); document.readyState != "complete" ? (window.attachEvent ? window.attachEvent('onload', ldinsp) : window.addEventListener('load', ldinsp, false)) : ldinsp();
	})();
	</script>
	<!-- End Inspectlet Embed Code -->

	<!--[if lte IE 8]>
		<script>
			document.createElement('popover');
			document.createElement('mortgage');
			document.createElement('vendor');
			document.createElement('hp-signpost');
			document.createElement('category-switcher');
			document.createElement('feedback');
			document.createElement('bottom');
			document.createElement('panorama');
			document.createElement('panorama-prev');
			document.createElement('sphere-viewer');
			document.createElement('sphere-viewer-prev');
			document.createElement('save-filter');
		</script>
    <![endif]-->

	<!-- Statistiky -->
	<script src="https://h.imedia.cz/js/dot-small.js" type="text/javascript"></script>
	<script type="text/javascript">
		(function() {
			try {
				// Při přesměrování na hashbang URL (IE8-9) ztrácíme referrer,
				// který je potřeba pro správné počítání statistik.
				if (window.sessionStorage) { // někdo může mít DOM storage zakázaný
					var l = document.createElement('a');
					l.href = document.referrer;
					var referrerHostname = l.hostname;

					if (window.location.hostname != referrerHostname) {
						window.sessionStorage.setItem('referrer', l.href);
					}
				}

				// Starý android (< 4.0) v kombinaci s angularem špatně pracuje s hashem v URL.
				// Považuje ho za součást query případně path.
				// Na takových zařízech se budeme tvářit, že žádný hash nebyl.
				if (parseInt((/android (\d+)/.exec(window.navigator.userAgent.toLowerCase()) || [])[1], 10) < 4) {
					var hrefWithoutHashbang = window.location.href.replace('/#!', '');
					var hashIndex = hrefWithoutHashbang.indexOf('#');
					if (hashIndex != -1) {
						window.location.replace(hrefWithoutHashbang.substring(0, hashIndex));
					}
				}
			} catch (e) {}
		})();
	</script>

	<!-- API mapy.cz -->
	<script type="text/javascript" src="https://api4.mapy.cz/loader.js"></script>
	<script type="text/javascript">Loader.load(null, {poi: true, pano: true})</script>

	<!-- Login reklama -->
	<script src="https://i.imedia.cz/js/im3.js" type="text/javascript"></script>

	<script src="https://1.im.cz/software/promo/promo-sbrowser.js"></script>

	<!-- Rozkopírování SID cookie -->
	<script src="https://h.imedia.cz/js/sid.js"></script>

	<!-- Login -->
	<script src="https://login.szn.cz/js/api/login.js"></script>
	<script>
		login.cfg({
			serviceId: "sreality"
		});
	</script>

	<!-- KONFIGURACE -->
	<script src="/js/conf/config.js?2e96626"></script>

	<script src="/js/advert.js"></script>
	<script src="/js/all.js?2e96626"></script>

	<script type="text/javascript">
		if (window.DOT) {
			var dotCfg = {
				service: 'sreality'
			};
			if (window.SrealityABTest && window.SrealityABTest.getVariant()) {
				dotCfg.abtest = window.SrealityABTest.getVariant();
			}
			DOT.cfg(dotCfg);
		}
	</script>

	<noscript>
		<meta http-equiv="refresh" content="0;url=?_escaped_fragment_="/>
	</noscript>
	<meta name="fragment" content="!" ng-if="metaSeo.showMetaFragment" />

</head>
<!--[if IE 8]>    <body class="ie8"> <![endif]-->
<!--[if IE 9]>    <body class="notie8 ie9"> <![endif]-->
<!--[if gt IE 9]><!-->
<body class="notie8 notie9 lang-{{html.lang}}">
<!--<![endif]-->
	<div loading-line></div>

	<div page-layout>
		<div ng-view></div>
	</div>
</body>
</html>

尽管它与我在Chrome开发人员工具中查看页面时看到的内容有所不同-部分代码在此处(整个代码不适合此处,并且由于某些原因上载文本不起作用) :

Though it is different from the one I see when I look at the page in Chrome's developer tool - a part of the code is here (the whole code doesn't fit in here and uploadtext isn't working for some reason):

<!DOCTYPE html>
<html lang="cs" ng-app="sreality" ng-controller="MainCtrl" class="ng-scope"><head><style type="text/css">@charset "UTF-8";[ng\:cloak],[ng-cloak],[data-ng-cloak],[x-ng-cloak],.ng-cloak,.x-ng-cloak,.ng-hide{display:none !important;}ng\:form{display:block;}.ng-animate-block-transitions{transition:0s all!important;-webkit-transition:0s all!important;}.ng-hide-add-active,.ng-hide-remove{display:block!important;}</style>
	<meta charset="utf-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">

	<!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS --->
	<title ng:bind-template="Byty na prodej Brno-město, posledních 30 dní • Sreality.cz" class="ng-binding">Byty na prodej Brno-město, posledních 30 dní • Sreality.cz</title>
	<meta name="description" content="284 realit v nabídce prodej bytů Brno-město s požadavky: posledních 30 dní. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů.">
	<meta property="og:title" content="Byty na prodej Brno-město, posledních 30 dní">
	<meta property="og:type" content="website">
	<meta property="og:image" content="https://www.sreality.cz/img/sreality-logo-og.png">
	<meta property="og:description" content="284 realit v nabídce prodej bytů Brno-město s požadavky: posledních 30 dní. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů.">
	<meta property="og:url" content="https://www.sreality.cz/hledani/prodej/byty/brno?stari=mesic">

	<!-- ngIf: metaStatus.value --><meta ng-if="metaStatus.value" name="szn:status" content="200" class="ng-scope"><!-- end ngIf: metaStatus.value -->

	<meta http-equiv="imagetoolbar" content="no">

	<link rel="icon" sizes="16x16 32x32 48x48 64x64" href="/img/icons/favicon.ico">
	<link rel="apple-touch-icon" sizes="57x57" href="/img/icons/apple-touch-icon-57x57.png?3">
	<link rel="apple-touch-icon" sizes="60x60" href="/img/icons/apple-touch-icon-60x60.png?3">
	<link rel="apple-touch-icon" sizes="72x72" href="/img/icons/apple-touch-icon-72x72.png?3">
	<link rel="apple-touch-icon" sizes="76x76" href="/img/icons/apple-touch-icon-76x76.png?3">
	<link rel="apple-touch-icon" sizes="114x114" href="/img/icons/apple-touch-icon-114x114.png?3">
	<link rel="apple-touch-icon" sizes="120x120" href="/img/icons/apple-touch-icon-120x120.png?3">
	<link rel="apple-touch-icon" sizes="144x144" href="/img/icons/apple-touch-icon-144x144.png?3">
	<link rel="apple-touch-icon" sizes="152x152" href="/img/icons/apple-touch-icon-152x152.png?3">
	<link rel="apple-touch-icon" sizes="180x180" href="/img/icons/apple-touch-icon-180x180.png?3">
	<link rel="icon" type="image/png" sizes="192x192" href="/img/icons/android-chrome-192x192.png">
	<link rel="icon" type="image/png" sizes="32x32" href="/img/icons/favicon-32x32.png">
	<link rel="icon" type="image/png" sizes="96x96" href="/img/icons/favicon-96x96.png">
	<link rel="icon" type="image/png" sizes="16x16" href="/img/icons/favicon-16x16.png">
	<link rel="manifest" href="/img/icons/android-chrome-manifest.json">
	<meta name="msapplication-TileColor" content="#2b5797">
	<meta name="msapplication-TileImage" content="/img/icons/ms-icon-144x144.png">
	<meta name="msapplication-config" content="/img/icons/browserconfig.xml">
<!-- ngIf: rss.url --><link rel="alternate" type="application/rss+xml" ng-href="/api/cs/v2/estates/rss?category_main_cb=1&amp;locality_district_id=72&amp;suggested_regionId=-1&amp;suggested_districtId=-1&amp;estate_age=31&amp;locality_region_id=14&amp;category_type_cb=1" ng-if="rss.url" class="ng-scope" href="/api/cs/v2/estates/rss?category_main_cb=1&amp;locality_district_id=72&amp;suggested_regionId=-1&amp;suggested_districtId=-1&amp;estate_age=31&amp;locality_region_id=14&amp;category_type_cb=1"><!-- end ngIf: rss.url -->

我可以从第一个request.get下载的html代码中看到,该页面运行了一些脚本,这些脚本可能会导致html有所不同.

I can see from the first html code that requests.get downloads that the page runs some scripts which probably cause the html to be different.

我已经尝试使用urllib,但是结果html doc仍然相同.

I already tried using urllib but the result html doc was still the same.

有没有一种方法可以下载在Chrome开发人员工具中打开页面时看到的html,以便我可以对其进行抓取?

Is there a way to download the html I see when I open the page in Chromes's developer tool so I can scrape it?

推荐答案

如果最终要查找该页面中的数据,则可以结合使用硒和BeautifulSoup来轻松获得它.它为您提供了公寓的所有链接.

If eventually data from that page you are after, you can get it very easily using selenium in combination with BeautifulSoup. It gives you all the links of apartments.

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()

driver.get("https://www.sreality.cz/hledani/prodej/byty/brno?stari=mesic")
soup = BeautifulSoup(driver.page_source,"html.parser")
driver.quit()

for title in soup.select(".text-wrap"):
    num = "https://www.sreality.cz" + title.select_one(".title").get('href')
    print(num)

这篇关于显示与HTML开发人员工具不同的HTML的Requests.get的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆