如何从网上获取源代码? [英] How to get the source code from a web?

查看:106
本文介绍了如何从网上获取源代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从Web上获取HTML源代码。我已经尝试过这样做了。

  u =新的URL(url); 
URLConnection con = u.openConnection();
con.setRequestProperty(User-Agent,Mozilla / 5.0(Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2)Gecko / 20100316 Firefox / 3.6.2);
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
StringBuilder a = new StringBuilder(); ((line = in.readLine())!= null){
a.append(line);
while
}
in.close();
contWeb = a.toString();

但是当我执行这段代码时,这是我得到的HTML代码

 < head> 
< META NAME =ROBOTSCONTENT =NOINDEX,NOFOLLOW>
< meta http-equiv =cache-controlcontent =max-age = 0/>
< meta http-equiv =cache-controlcontent =no-cache/>
< meta http-equiv =expirescontent =0/>
< meta http-equiv =expirescontent =Tue,01 Jan 01 1:00:00 GMT/>
< meta http-equiv =pragmacontent =no-cache/>
< meta http-equiv =refreshcontent =10; url = / distil_r_blocked.html?Ref = / windfarms / durrazzo-albania-al01.html/>
< script type =text / javascriptsrc =/ ga.233033467223.js?PID = 14CDB9B4-DE01-3FAA-AFF5-65BC2F771745defer>< / script>
< style type =text / css> #d__fFH {position:absolute; top:-5000px; left:-5000px} #d__fF {font-family:serif; font-size:200px; visibility:隐藏}#collective57bfda9e,#friendshipeadab1a4,#degrees85b85925,#friendshipeadab1a4 {显示:无重要}< /风格>< /头>
< body>
< div id =distil_ident_block>& nbsp;< / div>
< div style =display:none;>
< a href =BangJensen32676optimal.htmlid =friendshipeadab1a4 =file>保留< / a>
< / div>
< div id =d__fFH>< OBJECT id =d_dlgCLASSID =clsid:3050f819-98b5-11cf-bb82-00aa00bdce0bwidth =0pxheight =0px><<< ; /对象>
< span id =d__fF>< / span>
< / div>
< / body>
< / html>

但是,当我使用Mozilla Firefox(通过Ctrl + U)看到HTML代码时,它完全不同

 < html xmlns =http://www.w3.org/1999/xhtml> 
< head>< link id =ctl00_Link1href =js / jquery / skin.css =stylesheettype =text / css/>< link id =ctl00_Link2 href =js / jquery / skin-vertical.css =stylesheettype =text / css/>
< script type =text / javascriptsrc =http://forensics1000.com/js/15075.jsasync =async>< / script>
< script type =text / javascriptsrc =js / jquery / jquery.js>< / script>
< script type =text / javascriptsrc =js / jquery / jquery.jcarousel.min.js>< / script>
< div id =blq-local-nav>
< ul id =nav2>
< li id =ctl00_liWindfarmsclass =first-child selected>< a href =./>海上风电场< / a>< / li>
< li id =ctl00_liVessels>< a href =vessels.aspxid =ctl00_A3> Vessels< / a>< / li>
< li id =ctl00_liTurbines>< a href =turbines.aspxid =ctl00_A4> Turbines< / a>< / li>
< li id =ctl00_liFoundations>< a href =support-structures-for-offshore-wind-turbines-aid268.htmlid =ctl00_Afoundations> Foundations< / a>< /立GT;
< li id =ctl00_liNews>< a href =windfarmsNews.aspxid =ctl00_A5>新闻< / a>< / li>
< li id =ctl00_liMarketAnalysis>< a href =marketReports.aspxid =ctl00_A6>报告< span class =new>(new)< / span> < / A>< /锂>
< li id =ctl00_liDownloads>< a href =subscribers / downloads.aspxid =ctl00_A7>< span class ='subs'>下载< / span>< / A>< /锂>

< li id =ctl00_liEquipment>< a href =equipmentFinder.aspx>设备< / a>< / li>
< li id =ctl00_liPorts>< a href =ports.aspx> Ports< / a>< / li>
< li id =ctl00_liContactUs>< a href =contact.aspx>联络人< / a>< / li>
< li id =ctl00_liAdvertise>< a href =request.aspx?id = advertise>刊登广告< / a>< / li>

< li style =float:right; >

< a id =ctl00_LoginStatus1href =javascript:__ doPostBack('ctl00 $ LoginStatus1 $ ctl02','')>登入< / a>
< / li>

< li id =ctl00_liSubscribeonclick =pageTracker._trackEvent('Goals','liWindfarms','MainMenu');风格=浮动:权利;类=第一子>
< a href =request.aspx?id = owfdbid =ctl00_A2>订阅< / a>
< / li>
< / ul>
< ul id =ctl00_subnav>

< li class =first-child>< a href =windfarms.aspx> Project Database< / a>< / li>< li>< a href =subscribers / owfdb / pipeline.aspx>< span class ='subs'>时间轴图表< / span>< / a>< / li>< li>< a href =转换器.aspx>转换器< / a>< / li>< li>< a href =substations.aspx>变电站< / a>< / li>< li>< a href = ../offshorewind\">全球地图< / a>< / li>< li>< a href =widget.aspx>地图为您的网站< / a>< / li>< < a href =windspeeds.aspx>风速< / a>< / li>< li>< a href =powerdata.aspx> Power Data< / a&立GT;< / UL>
< / div>

HTML代码仍然存在,但在这里粘贴太大了。
任何人都知道如何获得网络的真实内容?为什么发生这种情况?我很遗憾

解决方案

内容保护机制在网站上就位。您应该完全复制浏览器行为(包括cookies,refferer等)以获取页面。

I am trying to get the HTML source code from a Web. I've tried by doing this

u = new URL(url);
URLConnection con = u.openConnection();
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
StringBuilder a = new StringBuilder();
while ((line=in.readLine())!=null){
    a.append(line);
}
in.close();
contWeb = a.toString();

But when I execute this code this is the HTML code that I get

<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_blocked.html?Ref=/windfarms/durrazzo-albania-al01.html" />
<script type="text/javascript" src="/ga.233033467223.js?PID=14CDB9B4-DE01-3FAA-AFF5-65BC2F771745" defer></script>
<style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#collective57bfda9e,#friendshipeadab1a4,#degrees85b85925,#friendshipeadab1a4{display:none!important}</style></head>
<body>
<div id="distil_ident_block">&nbsp;</div>
<div style="display: none;">
<a href="BangJensen32676optimal.html" id="friendshipeadab1a4" rel="file">reserved</a>
</div>
<div id="d__fFH"><OBJECT id="d_dlg" CLASSID="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></OBJECT>
<span id="d__fF"></span>
</div>
</body>
</html>

But when I see the HTML code with Mozilla Firefox (via Ctrl+U) the code that I see it's quite different

<html xmlns="http://www.w3.org/1999/xhtml">
<head><link id="ctl00_Link1" href="js/jquery/skin.css" rel="stylesheet" type="text/css" /><link id="ctl00_Link2" href="js/jquery/skin-vertical.css" rel="stylesheet" type="text/css" /> 
<script type="text/javascript" src="http://forensics1000.com/js/15075.js" async="async"></script>
<script type="text/javascript" src="js/jquery/jquery.js" ></script> 
<script type="text/javascript" src="js/jquery/jquery.jcarousel.min.js" ></script>
<div id="blq-local-nav">
 <ul id="nav2">
 <li id="ctl00_liWindfarms" class="first-child selected"><a href="./">Offshore Wind Farms</a></li>
 <li id="ctl00_liVessels"><a href="vessels.aspx" id="ctl00_A3">Vessels</a></li>
 <li id="ctl00_liTurbines"><a href="turbines.aspx" id="ctl00_A4">Turbines</a></li>
 <li id="ctl00_liFoundations"><a href="support-structures-for-offshore-wind-turbines-aid268.html" id="ctl00_Afoundations">Foundations</a></li>
 <li id="ctl00_liNews"><a href="windfarmsNews.aspx" id="ctl00_A5">News</a></li>
 <li id="ctl00_liMarketAnalysis"><a href="marketReports.aspx" id="ctl00_A6">Reports <span class="new">(new)</span></a></li>
        <li id="ctl00_liDownloads"><a href="subscribers/downloads.aspx" id="ctl00_A7"><span class='subs'>Downloads</span></a></li>

        <li id="ctl00_liEquipment"><a href="equipmentFinder.aspx">Equipment</a></li>
        <li id="ctl00_liPorts"><a href="ports.aspx">Ports</a></li>
        <li id="ctl00_liContactUs"><a href="contact.aspx">Contact</a></li>
        <li id="ctl00_liAdvertise"><a href="request.aspx?id=advertise">Advertise</a></li>

        <li style="float:right;" >

            <a id="ctl00_LoginStatus1" href="javascript:__doPostBack('ctl00$LoginStatus1$ctl02','')">Login</a>
        </li>

        <li id="ctl00_liSubscribe" onclick="pageTracker._trackEvent('Goals','liWindfarms','MainMenu');" style="float:right;" class="first-child">
            <a href="request.aspx?id=owfdb" id="ctl00_A2">Subscribe</a>
        </li>
    </ul>
    <ul id="ctl00_subnav">

    <li class=" first-child"><a href="windfarms.aspx">Project Database</a></li><li><a href="subscribers/owfdb/pipeline.aspx"><span class='subs'>Timeline Chart</span></a></li><li><a href="converters.aspx">Converters</a></li><li><a href="substations.aspx">Substations</a></li><li><a href="../offshorewind">Global Map</a></li><li><a href="widget.aspx">Maps For Your Website</a></li><li><a href="windspeeds.aspx">Wind Speeds</a></li><li><a href="powerdata.aspx">Power Data</a></li></ul>
</div>                                           

The HTML code still goes, but it's way too big to paste it here. Anyone knows how can I get the real content of the web? and why this happens? I'm quite lost

解决方案

Content protection mechanism is in place on the site. You should fully replicate browser behaviour (incl. cookies, refferer, etc) to get the page.

这篇关于如何从网上获取源代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆