如何从PHP生成的HTML页面获取正文内容？ [英] How to get the body content from a PHP-generated HTML page?

查看：461 发布时间：2018/6/21 13:33:37 java php html

本文介绍了如何从PHP生成的HTML页面获取正文内容？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

 字符串malSearch =http：/ /myanimelist.net/anime.php?letter=+ firstLetter; 
网址url =新网址（malSearch）; 
 URLConnection con = url.openConnection（）; 
 InputStream in = con.getInputStream（）; 
字符串编码= con.getContentEncoding（）; 
编码=编码== null？ UTF-8：编码; 
 ByteArrayOutputStream baos = new ByteArrayOutputStream（）; 
 byte [] buf =新字节[8192]; 
 int len = 0; （（len = in.read（buf））！= -1）{
 baos.write（buf，0，len）; 
 while 
} 
 String body = new String（baos.toByteArray（），encoding）;

它可以正常工作，但它并不能给我真正想要的东西。它给了我这个：

 < html> 
< head> 
< META NAME =ROBOTSCONTENT =NOINDEX，NOFOLLOW> 
< meta name =format-detectioncontent =telephone = no> 
< meta name =viewportcontent =initial-scale = 1.0> 
< meta http-equiv =X-UA-Compatiblecontent =IE = edge，chrome = 1> 
< / head> 
< body style =margin：0px> （0,1 -1 -1）r（0-1）B12（4,315，b1），其中， 0）U1& incident_id = 124001330081285077-564449081699338326& edet = 12& cinfo = 4ee46646c753833e04000000frameborder = 0 width =100％height =100％marginheight =0pxmarginwidth =0px>请求失败。 Incapsula事件ID：124001330081285077-564449081699338326< / iframe> 
< / body> 
< / html>

当它应该给我整个页面（约800行）。

我认为这是因为这是一个使用PHP的网站，但我并不确定。有人能告诉我如何获得整个HTML内容吗？

以下是我尝试从中获取内容的页面： http://myanimelist.net/anime.php?letter=A
解决方案
本网站使用名为Incapsula的服务。
网站管理员配置了Incapsula以防止漫游器访问它的内容。

我建议您联系管理员并要求列入白名单，
尝试绕过系统可能会让你被禁止并被列入黑名单。

I am trying to get the content of an HTML page, using this code:
String malSearch = "http://myanimelist.net/anime.php?letter=" + firstLetter; URL url = new URL(malSearch); URLConnection con = url.openConnection(); InputStream in = con.getInputStream(); String encoding = con.getContentEncoding(); encoding = encoding == null ? "UTF-8" : encoding; ByteArrayOutputStream baos = new ByteArrayOutputStream(); byte[] buf = new byte[8192]; int len = 0; while ((len = in.read(buf)) != -1) { baos.write(buf, 0, len); } String body = new String(baos.toByteArray(), encoding);
It works fine, but it doesn't give me what I really want. It gives me this:
<html> <head> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> <meta name="format-detection" content="telephone=no"> <meta name="viewport" content="initial-scale=1.0"> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> </head> <body style="margin:0px"> <iframe src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=6-122029399-0 0NNN RT(1404149034204 2) q(0 -1 -1 -1) r(0 -1) B12(4,315,0) U1&incident_id=124001330081285077-564449081699338326&edet=12&cinfo=4ee46646c753833e04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 124001330081285077-564449081699338326</iframe> </body> </html>
when it should give me the whole page (approximately 800 lines).

I think it's due to the fact this is a website using PHP, but I'm not really sure. Can someone tell me how I could get the whole HTML content?

Here's the page I'm trying to get the content from: http://myanimelist.net/anime.php?letter=A
解决方案
This site uses a service called Incapsula. The website admins configured Incapsula to prevent bots from accessing it's content.

I suggest you contact the admins and ask to be whitelisted, Trying to bypass the system will likely get you banned and blacklisted.

这篇关于如何从PHP生成的HTML页面获取正文内容？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从PHP生成的HTML页面获取正文内容？ [英] How to get the body content from a PHP-generated HTML page?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何从PHP生成的HTML页面获取正文内容？ [英] How to get the body content from a PHP-generated HTML page?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭