C ++获取HTML源代码 [英] C++ Get HTML Source

查看:126
本文介绍了C ++获取HTML源代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何将网站的HTML源代码下载到字符串中,而无需使用LibCurl。我已经在网上搜索使用Wininet的示例。

I would like to know how I can download a website's HTML source into a string, without using LibCurl. I have searched online for examples on using Wininet.

以下是我用于Wininet的示例代码。我如何使用Winsock做同样的事情?

Below is an example code I used for Wininet. How would I do the same using Winsock?

    #include "stdafx.h"
#include <windows.h>
#include <wininet.h>
#include <iostream>
#include <string>
#include <stdio.h>
#include <stdlib.h>
using namespace std;

#pragma comment ( lib, "Wininet.lib" )

int main()
{
    HINTERNET hInternet = InternetOpenA("InetURL/1.0", INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);

    HINTERNET hConnection = InternetConnectA(hInternet, "google.com", 80, " ", " ", INTERNET_SERVICE_HTTP, 0, 0);

    HINTERNET hData = HttpOpenRequestA(hConnection, "GET", "/", NULL, NULL, NULL, INTERNET_FLAG_KEEP_CONNECTION, 0);

    char buf[2048];
    string lol;
    HttpSendRequestA(hData, NULL, 0, NULL, 0);

    DWORD bytesRead = 0;
    DWORD totalBytesRead = 0;
    // http://msdn.microsoft.com/en-us/library/aa385103(VS.85).aspx
    // To ensure all data is retrieved, an application must continue to call the
    // InternetReadFile function until the function returns TRUE and the
    // lpdwNumberOfBytesRead parameter equals zero. 
    while (InternetReadFile(hData, buf, 2000, &bytesRead) && bytesRead != 0)
    {
        buf[bytesRead] = 0; // insert the null terminator.

        puts(buf);          // print it to the screen.
        lol = lol + buf;

        printf("%d bytes read\n", bytesRead);

        totalBytesRead += bytesRead;
    }

    printf("\n\n END -- %d bytes read\n", bytesRead);
    printf("\n\n END -- %d TOTAL bytes read\n", totalBytesRead);

    InternetCloseHandle(hData);
    InternetCloseHandle(hConnection);
    InternetCloseHandle(hInternet);

    cout << "\nThe beginning." << endl << endl << endl;

    cout << lol << endl;


    system("PAUSE");
}

此示例WinSock适用于没有其他路径的站点。我将如何抓取这样的页面的HTML:(www.website.com/page)

This example of WinSock works for sites without additional paths. How would I grab the HTML of a page like this: (www.website.com/page)

    #include "stdafx.h"
#include <iostream>
#include <winsock2.h>
#include <string>
#include <fstream>
using namespace std;


string get_source()
{
    WSADATA WSAData;
    WSAStartup(MAKEWORD(2, 0), &WSAData);

    SOCKET sock;
    SOCKADDR_IN sin;

    char buffer[1024];

    ////////////////This is portion that is confusing me//////////////////////////////////////////////////
    string srequete = "GET /id/AeroNX/ HTTP/1.1\r\n";
    srequete += "Host: steamcommunity.com\r\n";
    srequete += "Connection: close\r\n";
    srequete += "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\n";
    srequete += "Accept-Language: fr,fr-fr;q=0.8,en-us;q=0.5,en;q=0.3\r\n";
    srequete += "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n";
    srequete += "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3\r\n";
    srequete += "Referer: http://pozzyx.net/\r\n";
    srequete += "\r\n";
    ///////////////////////////////////////////////////////////////////////////////////////////////////////

    size_t requete_taille = srequete.size() + 1;

    char crequete[5000];
    strncpy(crequete, srequete.c_str(), requete_taille);

    int i = 0;
    string source = "";

    sock = socket(AF_INET, SOCK_STREAM, 0);

    sin.sin_addr.s_addr = inet_addr("63.228.223.103"); // epguides.com //why wont it work for 72.233.89.200 (whatismyip.com)
    sin.sin_family = AF_INET;
    sin.sin_port = htons(80); // port HTTP.

    connect(sock, (SOCKADDR *)&sin, sizeof(sin)); // on se connecte sur le site web.
    send(sock, crequete, strlen(crequete), 0); // why do we send the string??


    do
    {
        i = recv(sock, buffer, sizeof(buffer), 0); // le buffer récupère les données reçues.
        source += buffer;
    } while (i != 0);


    closesocket(sock); // on ferme le socket.
    WSACleanup();

    return source;
}

void main()
{
    ofstream fout;
    fout.open("Buffer.txt");
    fout << get_source(); // the string url doesnt matter
    fout.close();
    system("PAUSE");
}


推荐答案

需要帮助一点点的HTTP,而不是一个细节的整个事情。

Okay, I see you just need help on one little bit of HTTP, not a breakdown on the whole thing. I'm going to leave my full description for future readers, though, after I give the short answer for you.

简短的答案:

在第一行,你说 GET /foo/bar.html HTTP / 1.1 ,中间部分( foo / bar.html )是资源的路径。例如,如果你想得到 http://www.myserver.com/foo/bar.html ,那么你把 / foo / bar.html 。如果你想得到 http://www.myserver.com/get/my/file.html ,那么你的请求的第一行将是 GET /get/my/file.html HTTP / 1.1 。您的请求的其余行不需要更改以获取不同的资源(虽然如果您从不同的服务器完全获取某些东西,您将需要更改 Host:,例如, Host:www.myserver.com )。

In the first line, where you say GET /foo/bar.html HTTP/1.1, the middle part (/foo/bar.html) is the path to the resource. So, for example, if you want to get http://www.myserver.com/foo/bar.html then you put /foo/bar.html there. If you wanted to get http://www.myserver.com/get/my/file.html then the first line of your request would be GET /get/my/file.html HTTP/1.1. The remaining lines of your request do not need to change to get a different resource (although you'll want to change Host: if you get something from a different server entirely, e.g., Host: www.myserver.com).

HTTP的完整说明:

Full description of HTTP:

您是否试图获取它,而不使用任何库,只是原始套接字?如果是这样,你必须实现HTTP协议(无论如何是客户端),但好消息是HTTP是非常容易学习的,几乎可以轻松实现。 :)

Are you trying to get it without using any libraries, just raw sockets? If so, you'll have to implement the HTTP protocol (client side of it anyway), but the good news is HTTP is really easy to learn and almost as easy to implement. :)

要发送页面请求,请打开到Web服务器上端口80的连接。然后发送:

To send a request for a page, open a connection to port 80 on the web server. Then send it this:

GET <resource> HTTP/1.1\r\n
Host: <web_server_name>\r\n
Connection: close\r\n
\r\n

注意,我已经明确地放在 \r\\\
休息给你看。有两个重要的事情:1)你必须使用 \r\\\
,而不只是 \\\
在协议中,和2)HTTP头的结尾必须有一个double \r\\\
\r\\\
。 (对于您的请求,没有数据部分,因此标题的结尾也是整个请求消息的结尾。)

Note that I have explicitly put in the \r\n lie breaks to show you. There are two important things about them: 1) you must use \r\n and not just \n in the protocol, and 2) the end of the HTTP header must have a double \r\n\r\n. (For your request, there is no data section, so the end of the header is also the end of your entire request message.)

替换 < resource> 与您要获取的文件的路径,以及< web_server_name> 与Web服务器的DNS名称。例如,如果您要检索 http://www.cc.gatech.edu/~davel/classes/cs3251/summer2011/test/hypertext.html ,则< web_server_name> (主机字段)为 www.cc.gatech.edu ; resource> /~davel/classes/cs3251/summer2011/test/hypertext.html

Replace <resource> with the path to the file you want to get, and <web_server_name> with the DNS name of the web server. For example, if you wanted to retrieve http://www.cc.gatech.edu/~davel/classes/cs3251/summer2011/test/hypertext.html then the <web_server_name> (Host field) is www.cc.gatech.edu and the <resource> is /~davel/classes/cs3251/summer2011/test/hypertext.html.

Web服务器将在同一个套接字上发回一个HTTP响应消息。如果一切顺利,您将收到一条类似下面的消息:

The web server will send back an HTTP response message on the same socket. If all goes well, you will get a message back that looks something like this:

HTTP/1.1 200 OK\r\n
Date: Mon, 23 May 2005 22:38:34 GMT\r\n
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)\r\n
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT\r\n
ETag: "3f80f-1b6-3e1cb03b"\r\n
Content-Type: text/html; charset=UTF-8\r\n
Content-Length: 131\r\n
Connection: close\r\n
\r\n
<html>
<head>
  <title>An Example Page</title>
</head>
<body>
  Hello World, this is a very simple HTML document.
</body>
</html>

注意双重 \r\\\
\r\\\
,表示HTTP标头的结尾。之后是数据部分,其中包含页面的HTML源代码。我已经省略明确显示数据部分的换行符,因为它们是数据本身的一部分,而不是HTTP协议(因此它们不必 \r\\\
)。还要注意Content-Length字段。它告诉你数据部分有多少字节长度(在这种情况下是HTML源代码),所以你可以从套接字中读取正确的长度。在数据部分的末尾没有 \r\\\
。 (数据本身在末尾可能包含或可能不包括换行符,如果包含,它将包含在Content-Length字节中。)

Note the double \r\n\r\n again, which denotes the end of the HTTP header. After that is the data section, which contains the HTML source of the page. I have omitted explicitly showing the line breaks for the data section because they are part of the data itself, not the HTTP protocol (so they don't have to be \r\n). Also note the Content-Length field. It tells you how many bytes long the data section is (the HTML source, in this case), so you can read the correct length from the socket. There is no \r\n at the end of the data section. (The data itself may or may not include a line break at the end. If it does, it will be included in the Content-Length bytes.)

硬部分正在接收和解析HTTP消息。我发现最容易接受HTTP的方法是从套接字中读取一行,解析每个标题字段(你不必处理每个字段;你可以忽略它们中的很多)。一旦你得到空白行,你知道头完成。然后从Content-Length指定的数据有效载荷的套接字中读取正确的字节数。 (通过验证1)在响应的第一行中有 200 OK ,在读取数据段之前进行错误检查可能是个好主意 - 其他指示某种类型错误,以及2)您实际上在标题中的某处有一个Content-Length字段。)

The only mildly hard part is recieving and parsing the HTTP messages. I find the easiest way to recieve HTTP is to read one line at a time from the socket, parsing each header field as you see it (you don't have to handle every field; you can probably ignore many of them). Once you get the blank line, you know the header is done. Then just read the correct number of bytes from the socket for your data payload, as specified by Content-Length. (It's probably a good idea to error check before reading the data section by verifying 1) that you got 200 OK in the first line of the response - something else indicates some kind of error, and 2) that you actually got a Content-Length field somewhere in the header.)

此外, Connection:close 字段中的请求,这是回应在响应中说服务器可以关闭TCP连接后,它已经发送您的响应。如果你想发出很多请求,你可以使用 Connection:keep-alive ,但它会更复杂一些,因为你必须注意反应。技术上,服务器允许发送回 Connection:close 并关闭套接字,即使您请求保持活动。因此,只要使用 Connection:close 就可以生成更简单的代码,如果你只需要一个页面,那么这是完全足够的。

Also, the Connection: close field in the request, which is echoed back in the response says that the server can close the TCP connection after it has sent you the response. If you want to make many requests, you might use Connection: keep-alive instead, but it gets a little more complicated because you have to pay attention to the Connection field in the response then. Technically, the server is allowed to send back Connection: close and close the socket even if you requested a keep-alive. So just going with Connection: close produces simpler code, and is perfectly adequate if you only want one page anyway.

HTTP的维基百科页面有一些帮助,但缺乏细节。 (我没有无耻地撕开我的HTTP响应示例,但是。)
https:// en。 wikipedia.org/wiki/Http

The Wikipedia page for HTTP is of some help, but lacks detail. (I did shamelessly rip my HTTP response example from there, though.) https://en.wikipedia.org/wiki/Http

如果某人有一个更好的在线参考的链接HTTP(比阅读标准文档更容易遵循)请随时添加/编辑此帖子或将其添加到评论中。

If someone has a link for a better online reference for HTTP (that's easier to follow than reading the standards document), please feel free to add it / edit this post, or put it in a comment.

这篇关于C ++获取HTML源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆