如何让 LWP::UserAgent 看起来像另一个浏览器? [英] How can I make LWP::UserAgent look like another browser?

查看:68
本文介绍了如何让 LWP::UserAgent 看起来像另一个浏览器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我在 SO 上的第一篇文章,所以请保持温和.我什至不确定这是否属于这里,但这里是.

This is my first post on SO, so be gentle. I'm not even sure if this belongs here, but here goes.

我想访问有关我的一个个人帐户的一些信息.该网站写得不好,需要我手动输入我想要信息的日期.这真的是一种痛苦.我一直在寻找学习更多 Perl 的借口,所以我认为这将是一个很好的机会.我的计划是编写一个 Perl 脚本来登录我的帐户并为我查询信息.然而,我很快就卡住了.

I want to access some information on one of my personal accounts. The website is poorly written and requires me to manually input the date I want the information for. It is truly a pain. I have been looking for an excuse to learn more Perl so I thought this would be a great opportunity. My plan was to write a Perl script that would login to my account and query the information for me. However, I got stuck pretty quickly.

my $ua = LWP::UserAgent->new;
my $url = url 'https://account.web.site';
my $res = $ua->request(GET $url);

生成的网页基本上说我的网络浏览器不受支持.我为

The resulting web page basically says that my web browser is not supported. I tried a number of different values for

$ua->agent("");

但似乎没有任何效果.Google-ing 建议使用这种方法,但它也表示 perl 被用于网站上的恶意原因.网站会阻止这种方法吗?我正在尝试做的甚至可能吗?有没有更合适的不同语言?我正在尝试做的甚至是合法的还是一个好主意?也许我应该放弃我的努力.

but nothing nothings seems to work. Google-ing around suggests this method, but it also says that perl is used for malicious reasons on web sites. Do web sites block this method? Is what I am trying to do even possible? Is there a different language that would be more appropriate? Is what I'm trying to do even legal or even a good idea? Maybe I should just abandon my efforts.

请注意,为了防止泄露任何私人信息,我在此处编写的代码并不是我使用的确切代码.不过,我希望这很明显.

Note that to prevent giving away any private information, the code I wrote here is not the exact code I am using. I hope that was pretty obvious, though.

在 FireFox 中,我禁用了 JavaScript 和 CSS.我登录得很好,没有不兼容的浏览器"错误.这似乎不是 JavaScript 问题.

In FireFox, I disabled JavaScript and CSS. I logged in just fine without the "Incompatible browser" error. It doesn't seem to be JavaScript issue.

推荐答案

通过抓取获取不同的网页

我们必须做出一个假设,如果给定相同的输入,网络服务器将返回相同的输出.有了这个假设,我们不可避免地得出结论,我们没有给它相同的输入.在这种情况下有两个浏览器,或 http 客户端:一个为您提供您想要的结果(例如,Firefox、IE、Chrome 或 Safari),另一个是给你你想要的结果(例如,LWP、wget 或 cURL).

Getting a different webpage with scraping

We have to make one assumption, the web-server will return the same output if given the same input. With this assumption we inescapably come to the conclusion we're not giving it the same input. There are two browsers, or http clients in this scenario: the one that is giving you the result you want (ex., Firefox, IE, Chrome, or Safari), and the one that is not giving you the result you want (ex., LWP, wget, or cURL).

在继续之前,首先确保简单的 UserAgents 是相同的,您可以通过浏览到 whatsmyuseragent.com 来做到这一点并将其他浏览器标题中的 UserAgent 字符串设置为该网站返回的任何内容.您还可以使用 Firefox 的 Web 开发人员工具栏来禁用 CSS 和 JavaScript,Java 和元重定向:这将通过消除真正简单的东西来帮助您追踪问题.

Before, continuing firstly make sure the simple UserAgents are the same, you can do this by browsing to whatsmyuseragent.com and setting the UserAgent string in the header of the other browser to whatever that website returns. You can also use Firefox's Web Developer's Toolbar to disable CSS, and JavaScript, Java, and meta-redirects: this will help you track down the problem by killing off the really simple stuff.

现在使用 Firefox,您可以使用 FireBug 来分析发送的 REQUEST.您可以在 FireBug 中的 NET 选项卡下执行此操作,不同的浏览器应该具有可以执行 FireBug 与 FireFox 相同的工具;但是,如果您不了解相关工具,您仍然可以使用 tsharkwireshark,如下所述.重要的是要注意 tsharkwireshark 总是更准确,因为它们在较低的级别上工作,至少在我的经验中,这留下的错误空间较小.例如,您会看到诸如浏览器正在执行的元重定向之类的事情,有时 FireBug 可能会失去跟踪.

Now with Firefox you can use FireBug to analyze the REQUEST that is sent. You can do this under the NET tab in FireBug, different browsers should have tools that can do what FireBug does with FireFox; however, if you don't know the tool in question you can still use tshark or wireshark as described below. It is important to note that tshark and wireshark will always be more accurate because they work at a lower level which at least in my experience leaves less room for error. For example, you'll see things like meta-redirects the browser is doing which sometimes FireBug can lose track of.

在您了解第一个有效的 Web 请求后,尽最大努力将第二个 Web 请求设置为第一个.我的意思是正确设置请求头和其他请求元素.如果这仍然不起作用,您必须知道第二个浏览器在做什么才能看到哪里出了问题.

After you understand the first web-request that works, do your best to set the second web-request to that of the first. By this I mean setting the request-headers properly and other request elements. If this still doesn't work you have to know what the second browser is doing to see what is wrong.

为了解决这个问题,我们必须全面了解来自两个浏览器的请求.第二个浏览器通常更狡猾,这些通常是库和非交互式命令行浏览器,它们缺乏检查请求的能力.如果他们有能力转储请求,您仍然可以选择简单地检查它们.为此,我建议使用 wireshark 和 tshark 套件.应该立即警告您,因为它们在浏览器下运行.默认情况下,您将看到实际的网络 (IP) 数据包和数据链路帧.您可以使用这样的命令过滤出您特别需要的内容.

In order to troubleshoot this, we must have a total understanding of the requests from both browsers. The second browser is usually tricker, these are often libraries and non-interactive command line browsers that lack the ability to check the request. If they have the ability to dump the request you might still opt to simply check them anyway. To do this I suggest the wireshark and tshark suite. Immediately, you should be warned that because these operate below the browser. By default, you'll see the actual network (IP) packets, and data-link frames. You can filter out what you need specifically with a command like this.

sudo tshark -i <interface> -f tcp -R "http.request" -V |
perl -ne'print if /^Hypertext/../^Frame/'

这将捕获所有 TCP 数据包,仅显示过滤 http.requests,然后 perl 过滤器仅过滤第 4 层 HTTP 内容.您可能还想添加到显示过滤器以仅抓取单个 Web 服务器 -R "http.request and http.host == ''"

This will capture all of the TCP packets, display-filter only the http.requests, then perl filter for only layer 4 HTTP stuff. You might want to add to the display filter to only grab a single web server too -R "http.request and http.host == ''"

您需要检查所有内容以查看两个请求是否一致,cookie、GET url、用户代理等.确保该站点没有做一些愚蠢的事情.

You're going to want to check everything to see if the two requests are in line, cookies, GET url, user-agent, etc. Make sure the site doesn't do something goofy.

2010 年 1 月 23 日更新:根据新信息,我建议设置 AcceptAccept-LanguageAccept-字符集Accept-Encoding.你可以通过 $ua->default_headers() 做到这一点.如果您需要从用户代理中获得更多功能,您始终可以将其子类化.我为我的 GData API 采用了这种方法,你可以在 我的例子上找到github 上的 UserAgent 子类.

Updated Jan 23 2010: Based on the new information I would suggest setting Accept, and Accept-Language, Accept-Charset and Accept-Encoding. You can do that with through $ua->default_headers(). If what you demand is a lot more functionality out of your useragent, you can always subclass it. I took this aproach for my GData API, you can find my example on of a UserAgent subclass on github.

这篇关于如何让 LWP::UserAgent 看起来像另一个浏览器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆