使用C＃HttpClient登录网站并从另一个页面抓取信息 [英] Using C# HttpClient to login on a website and scrape information from another page

查看：683 发布时间：2020/10/25 23:13:25 c# web-scraping dotnet-httpclient web-inspector

本文介绍了使用C＃HttpClient登录网站并从另一个页面抓取信息的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用C＃和Chrome Web Inspector登录 http://www.morningstar.com 并在 http://financials.morningstar.com/income-statement/is.html?t=BTDPF&region=usa&culture=zh-CN 。

I am trying to use C# and Chrome Web Inspector to login on http://www.morningstar.com and retrieve some information on the page http://financials.morningstar.com/income-statement/is.html?t=BTDPF&region=usa&culture=en-US.

我不太了解人们必须使用什么心理过程来解释Web Inspector中的信息以模拟登录并模拟保持会话并导航至下一页收集信息。

I do not quite understand what is the mental process one must use to interpret the information from Web Inspector to simulate a login and simulate keeping the session and navigating to the next page to collect information.

有人可以向我解释或指向我吗？

Can someone explain or point me to a resource ?

目前，我只有一些代码来获取主页和登录页面的内容：

For now, I have only some code to get the content of the home page and the login page:

public class Morningstar
{
    public async static void Ru4n()
    {
        var url = "http://www.morningstar.com/";
        var httpClient = new HttpClient();

        httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml");
        httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");
        httpClient.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
        httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1");

        var response = await httpClient.GetAsync(new Uri(url));
        response.EnsureSuccessStatusCode();
        using (var responseStream = await response.Content.ReadAsStreamAsync())
        using (var decompressedStream = new GZipStream(responseStream, CompressionMode.Decompress))
        using (var streamReader = new StreamReader(decompressedStream))
        {
            //Console.WriteLine(streamReader.ReadToEnd());
        }

        var loginURL = "https://members.morningstar.com/memberservice/login.aspx";
        response = await httpClient.GetAsync(new Uri(loginURL));
        response.EnsureSuccessStatusCode();
        using (var responseStream = await response.Content.ReadAsStreamAsync())
        using (var streamReader = new StreamReader(responseStream))
        {
            Console.WriteLine(streamReader.ReadToEnd());
        }

    }

编辑：最后，在根据穆罕默德的建议，我使用了以下代码：

In the end, on the advice of Muhammed, I used the following piece of code:

        ScrapingBrowser browser = new ScrapingBrowser();

        //set UseDefaultCookiesParser as false if a website returns invalid cookies format
        //browser.UseDefaultCookiesParser = false;

        WebPage homePage = browser.NavigateToPage(new Uri("https://members.morningstar.com/memberservice/login.aspx"));

        PageWebForm form = homePage.FindFormById("memberLoginForm");
        form["email_textbox"] = "example@example.com";
        form["pwd_textbox"] = "password";
        form["go_button.x"] = "57";
        form["go_button.y"] = "22";
        form.Method = HttpVerb.Post;
        WebPage resultsPage = form.Submit();

推荐答案

您应该模拟网站的登录过程。最简单的方法是通过某些调试器（例如 Fiddler ）检查网站。

You should simulate login process of the web site. The easiest way of this is inspecting website via some debugger (for example Fiddler).

以下是网站的登录请求：

Here is login request of the web site:

POST https://members.morningstar.com/memberservice/login.aspx?CustId=&CType=&CName=&RememberMe=true&CookieTime= HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Referer: https://members.morningstar.com/memberservice/login.aspx
** omitted **
Cookie: cookies=true; TestCookieExist=Exist; fp=001140581745182496; __utma=172984700.91600904.1405817457.1405817457.1405817457.1; __utmb=172984700.8.10.1405817457; __utmz=172984700.1405817457.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=172984700; ASP.NET_SessionId=b5bpepm3pftgoz55to3ql4me

email_textbox=test@email.com&pwd_textbox=password&remember=on&email_textbox2=&go_button.x=36&go_button.y=16&__LASTFOCUS=&__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=omitted&__EVENTVALIDATION=omited

会看到一些cookie和表单字段，例如 __VIEWSTATE。您需要输入此文件的实际值才能登录。您可以使用以下步骤：

When you inspect this, you'll see some cookies and form fields like "__VIEWSTATE". You'll need the actual values of this filed to log in. You can use following steps:

提出请求并删除 __LASTFOCUS， __ EVENTTARGET， __ EVENTARGUMENT， __ VIEWSTATE， __ EVENTVALIDATION；

在同一页面上创建一个新的POST请求，使用上一个中的CookieContainer；使用报废字段，用户名和密码来构建帖子字符串。使用MIME类型 application / x-www-form-urlencoded 进行发布。

如果成功，则使用cookie进行进一步请求以保持记录

Make a request and scrap fields like "__LASTFOCUS", "__EVENTTARGET", "__EVENTARGUMENT", "__VIEWSTATE", "__EVENTVALIDATION"; and cookies.
Create a new POST request to the same page, use CookieContainer from previous one; build a post string using scrapped fields, username and password. Post it with MIME type application/x-www-form-urlencoded.
If successful use the cookies for further requests to stay logged in.

注意：您可以使用 htmlagilitypack 或 scrapysharp 来废弃html。 ScrapySharp提供易于使用的工具，用于表单发布表单和浏览网站。

Note: You can use htmlagilitypack, or scrapysharp to scrap html. ScrapySharp provide easy to use tools for form posting forms and browsing websites.

这篇关于使用C＃HttpClient登录网站并从另一个页面抓取信息的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用C＃HttpClient登录网站并从另一个页面抓取信息 [英] Using C# HttpClient to login on a website and scrape information from another page

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

使用C＃HttpClient登录网站并从另一个页面抓取信息 [英] Using C# HttpClient to login on a website and scrape information from another page

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭