如何获得网页标题,而无需下载所有的页面的源代码 [英] How to get webpage title without downloading all the page source
问题描述
我在寻找,让我得到一个网页的标题,并将其存储为一个字符串的方法。
I'm looking for a method that will allow me to get the title of a webpage and store it as a string.
但是,所有我已经找到了解决方案到目前为止,涉及到下载的页面,这是不是大量的网页的真正实用的源代码。
However all the solutions I have found so far involve downloading the source code for the page, which isn't really practical for a large number of webpages.
我可以看到将是限制的唯一方法字符串的长度或只下载任何字符集数或停止一旦达到标签,然而,这显然仍是相当大的?
The only way I could see would be to limit the length of the string or it only downloads either a set number of chars or stops once it reaches the tag, however this obviously will still be quite large?
感谢
推荐答案
随着<标题>
标签是在HTML本身,将有没有办法的不的下载文件中找到只是称号。您应该能够下载文件的一部分,直到你在<阅读,标题>
标签或< /头>
标记,然后停下来,但你仍然需要下载(至少是一部分)的文件。
As the <title>
tag is in the HTML itself, there will be no way to not download the file to find "just the title". You should be able download a portion of the file until you've read in the <title>
tag, or the </head>
tag and then stop, but you'll still need to download (at least a portion of) the file.
这可以用<完成code>的HttpWebRequest / HttpWebResponse
并从响应流中读取数据,直到我们在无论是读 <标题>< /标题>
块,或< /头>
标记。我加了< /头>
标记检查,因为,在有效的HTML,标题块必须在头块中 - 所以,这个检查我们永远解析整个在任何情况下文件(除非当然没有头块)
This can be accomplished with HttpWebRequest
/HttpWebResponse
and reading in data from the response stream until we've either read in a <title></title>
block, or the </head>
tag. I added the </head>
tag check because, in valid HTML, the title block must appear within the head block - so, with this check we will never parse the entire file in any case (unless there is no head block, of course).
下面应该能够完成这项任务:
The following should be able to accomplish this task:
string title = "";
try {
HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
HttpWebResponse response = (request.GetResponse() as HttpWebResponse);
using (Stream stream = response.GetResponseStream()) {
// compiled regex to check for <title></title> block
Regex titleCheck = new Regex(@"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
int bytesToRead = 8092;
byte[] buffer = new byte[bytesToRead];
string contents = "";
int length = 0;
while ((length = stream.Read(buffer, 0, bytesToRead)) > 0) {
// convert the byte-array to a string and add it to the rest of the
// contents that have been downloaded so far
contents += Encoding.UTF8.GetString(buffer, 0, length);
Match m = titleCheck.Match(contents);
if (m.Success) {
// we found a <title></title> match =]
title = m.Groups[1].Value.ToString();
break;
} else if (contents.Contains("</head>")) {
// reached end of head-block; no title found =[
break;
}
}
}
} catch (Exception e) {
Console.WriteLine(e);
}
更新:更新的原始来源,例如,以使用编译正则表达式
和使用
语句流
为更好的效率和可维护性。
UPDATE: Updated the original source-example to use a compiled Regex
and a using
statement for the Stream
for better efficiency and maintainability.
这篇关于如何获得网页标题,而无需下载所有的页面的源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!