使用C#HttpClient登录网站并从另一个页面抓取信息 [英] Using C# HttpClient to login on a website and scrape information from another page

查看:683
本文介绍了使用C#HttpClient登录网站并从另一个页面抓取信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用C#和Chrome Web Inspector登录 http://www.morningstar.com 并在 http://financials.morningstar.com/income-statement/is.html?t=BTDPF&region=usa&culture=zh-CN

I am trying to use C# and Chrome Web Inspector to login on http://www.morningstar.com and retrieve some information on the page http://financials.morningstar.com/income-statement/is.html?t=BTDPF&region=usa&culture=en-US.

我不太了解人们必须使用什么心理过程来解释Web Inspector中的信息以模拟登录并模拟保持会话并导航至下一页收集信息。

I do not quite understand what is the mental process one must use to interpret the information from Web Inspector to simulate a login and simulate keeping the session and navigating to the next page to collect information.

有人可以向我解释或指向我吗?

Can someone explain or point me to a resource ?

目前,我只有一些代码来获取主页和登录页面的内容:

For now, I have only some code to get the content of the home page and the login page:

public class Morningstar
{
    public async static void Ru4n()
    {
        var url = "http://www.morningstar.com/";
        var httpClient = new HttpClient();

        httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml");
        httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");
        httpClient.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
        httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1");

        var response = await httpClient.GetAsync(new Uri(url));
        response.EnsureSuccessStatusCode();
        using (var responseStream = await response.Content.ReadAsStreamAsync())
        using (var decompressedStream = new GZipStream(responseStream, CompressionMode.Decompress))
        using (var streamReader = new StreamReader(decompressedStream))
        {
            //Console.WriteLine(streamReader.ReadToEnd());
        }

        var loginURL = "https://members.morningstar.com/memberservice/login.aspx";
        response = await httpClient.GetAsync(new Uri(loginURL));
        response.EnsureSuccessStatusCode();
        using (var responseStream = await response.Content.ReadAsStreamAsync())
        using (var streamReader = new StreamReader(responseStream))
        {
            Console.WriteLine(streamReader.ReadToEnd());
        }

    }

编辑:最后,在根据穆罕默德的建议,我使用了以下代码:

In the end, on the advice of Muhammed, I used the following piece of code:

        ScrapingBrowser browser = new ScrapingBrowser();

        //set UseDefaultCookiesParser as false if a website returns invalid cookies format
        //browser.UseDefaultCookiesParser = false;

        WebPage homePage = browser.NavigateToPage(new Uri("https://members.morningstar.com/memberservice/login.aspx"));

        PageWebForm form = homePage.FindFormById("memberLoginForm");
        form["email_textbox"] = "example@example.com";
        form["pwd_textbox"] = "password";
        form["go_button.x"] = "57";
        form["go_button.y"] = "22";
        form.Method = HttpVerb.Post;
        WebPage resultsPage = form.Submit();


推荐答案

您应该模拟网站的登录过程。最简单的方法是通过某些调试器(例如 Fiddler )检查网站。

You should simulate login process of the web site. The easiest way of this is inspecting website via some debugger (for example Fiddler).

以下是网站的登录请求:

Here is login request of the web site:

POST https://members.morningstar.com/memberservice/login.aspx?CustId=&CType=&CName=&RememberMe=true&CookieTime= HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Referer: https://members.morningstar.com/memberservice/login.aspx
** omitted **
Cookie: cookies=true; TestCookieExist=Exist; fp=001140581745182496; __utma=172984700.91600904.1405817457.1405817457.1405817457.1; __utmb=172984700.8.10.1405817457; __utmz=172984700.1405817457.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=172984700; ASP.NET_SessionId=b5bpepm3pftgoz55to3ql4me

email_textbox=test@email.com&pwd_textbox=password&remember=on&email_textbox2=&go_button.x=36&go_button.y=16&__LASTFOCUS=&__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=omitted&__EVENTVALIDATION=omited

会看到一些cookie和表单字段,例如 __VIEWSTATE。您需要输入此文件的实际值才能登录。您可以使用以下步骤:

When you inspect this, you'll see some cookies and form fields like "__VIEWSTATE". You'll need the actual values of this filed to log in. You can use following steps:


  1. 提出请求并删除 __LASTFOCUS, __ EVENTTARGET, __ EVENTARGUMENT, __ VIEWSTATE, __ EVENTVALIDATION;

  2. 在同一页面上创建一个新的POST请求,使用上一个中的CookieContainer;使用报废字段,用户名和密码来构建帖子字符串。使用MIME类型 application / x-www-form-urlencoded 进行发布。

  3. 如果成功,则使用cookie进行进一步请求以保持记录

  1. Make a request and scrap fields like "__LASTFOCUS", "__EVENTTARGET", "__EVENTARGUMENT", "__VIEWSTATE", "__EVENTVALIDATION"; and cookies.
  2. Create a new POST request to the same page, use CookieContainer from previous one; build a post string using scrapped fields, username and password. Post it with MIME type application/x-www-form-urlencoded.
  3. If successful use the cookies for further requests to stay logged in.

注意:您可以使用 htmlagilitypack scrapysharp 来废弃html。 ScrapySharp提供易于使用的工具,用于表单发布表单和浏览网站。

Note: You can use htmlagilitypack, or scrapysharp to scrap html. ScrapySharp provide easy to use tools for form posting forms and browsing websites.

这篇关于使用C#HttpClient登录网站并从另一个页面抓取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆