使用AngleSharp获取并下载图片 [英] Get and Download pictures with AngleSharp

查看:227
本文介绍了使用AngleSharp获取并下载图片的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始使用Anglesharp作为项目,我不仅需要下载HTML,还需要下载文档的图像。
我知道在Document对象中有一个名为Images的属性,但显然它没有得到它们,我在YouTube页面上做了一个测试,并且只有一个(重复了几次)。
例如,我想要获取当前视频的缩略图,这似乎在< meta> 标签中。
更确切地说,图像存储在这种标签内:

I started using Anglesharp for a Project, I need to get and download not only HTML but also images of the document. I know that in the Document object there is a property called Images, but appearently it doesn't get all of them, I did a test on a YouTube page and got only one (repeated several times). For example I'd like to get the thumbinail of the current video, and this seems to be inside a <meta> tag. To be more precise, images are stored inside this kind of tags:

<meta content="https://i.ytimg.com/vi/hW-kDv1WcQM/hqdefault.jpg" property="og:image">

所以我想知道是否有办法选择页面内任何图像的所有节点/ url ,不管使用的是什么标签。
我不认为QuerySelectorAll在这种情况下工作,因为它只选择一种类型的节点。
您可以尝试在github上找到的示例代码来验证(我刚更改了YouTube的url,选择器也是:D):

So I wonder if there is a way to select all the nodes/url of any image inside a page, no matter the tag used. I don't think that QuerySelectorAll does work in this case, as this selects only one type of node. You can try the sample code you find on github to verify that (I just changed the url with the YouTube one, and the selector too :D):

// Setup the configuration to support document loading
var config = Configuration.Default.WithDefaultLoader();
// Load the names of all The Big Bang Theory episodes from Wikipedia
var address  = "https://www.youtube.com/watch?v=hW-kDv1WcQM&feature=youtu.be";
// Asynchronously get the document in a new context using the configuration
var document = await BrowsingContext.New(config).OpenAsync(address);
// This CSS selector gets the desired content
var cellSelector = "img";
// Perform the query to get all cells with the content
var cells = document.QuerySelectorAll(cellSelector);
// We are only interested in the text - select it with LINQ
var titles = cells.Select(m => m.TextContent);

哦,舒尔,你也可以添加这个来检查Image属性没有得到视频缩略图:

Oh, shure, you can also add this to check that the Image property doesn't get the video thumbinails:

var Images = document.Images.Select(sl=> sl.Source).Distinct().ToList();

根据URL内容选择节点的其他方法? (像所有以.jpg或.png结尾的网址)

Any other method to select nodes based on the URL content? (like all of the urls ending with ".jpg", or ".png", etc.)

推荐答案

您可以使用LINQ API来获取页面中包含图片URL的所有属性,如下所示:

You can use the LINQ API to get all attributes that contains image URL in a page, like so :

.....
var document = await BrowsingContext.New(config).OpenAsync(address);

//list all image file extension here :
var fileExtensions = new string[] { ".jpg", ".png" };

//find all attribute in any element...
//where the value ends with one of the listed file extension                     
var result = from element in document.All
             from attribute in element.Attributes
             where fileExtensions.Any(e => attribute.Value.EndsWith(e))
             select attribute;

foreach (var item in result)
{
    Console.WriteLine(item.Value);
}

这篇关于使用AngleSharp获取并下载图片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆