使用正则表达式翻录复杂HTML文件的图像 [英] Rip images of complex HTML file using regex

查看:71
本文介绍了使用正则表达式翻录复杂HTML文件的图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的问题我在这里是他们不是在href评论或src这是我想要得到的代码的片段但我也希望得到它的大小所以像所有图像2048 x 2048相同4065 x 4065



ok issues im having here are they are not in href comment or src this is a sniplet of the code i want to get but i also want to get it by size so like all images that are 2048 x 2048 same with 4065 x 4065

<pre>#34;, "createdAt": "2018-07-30T13:33:21.373947"}, {"uid": "c442c352934545b183e16ce9aebd91cb", "width": 2048, "options": {"format": "R", "quality": 88}, "updatedAt": "2018-08-01T17:51:24.738232", "height": 2048, "size": 618478, "url": "https://media.sketchfab.com/urls/ea1adc30399045a2b101e16ba65a856f/dist/textures/a4291782af5f4ce39e637c89ec91fa9b/c442c352934545b183e16ce9aebd91cb.jpeg", "createdAt": "2018-08-01T17:51:25.334608"}, {"uid": "84275b9d01b54836893e355991288c2f", "width": 1024, "options": {"format": "R", "quality": 92}, "updatedAt": "2018-08-01T17:51:25.341010", "height": 1024, "size": 220485, "url": "https://media.sketchfab.com/urls/ea1adc30399045a2b101e16ba65a856f/dist/textures/a4291782af5f4ce39e637c89ec91fa9b/84275b9d01b54836893e355991288c2f.jpeg", "createdAt": "2018-08-01T17:51:25.451079"}, {"uid": "88897653dc004ded9faee4eaf2fa0373", "width": 512, "options": {"format": "R", "quality": 95}, "updatedAt": "2018-08-01T17:51:25.456671", "height": 512, "size": 83896, "url": "https://media.sketchfab.com/urls/ea1adc30399045a2b101e16ba65a856f/dist/textures/a4291782af5f4ce39e637c89ec91fa9b/88897653dc004ded9faee4eaf2fa0373.jpeg"







我想要做的是检查带和高,如果它匹配2048 x2048然后确定该图像并保存到文件夹相同的4096 x 4096



i设法为长链接制作正则表达式




what i want to do is check the with and hight and if it matches 2048 x2048 then exact that image and save to folder same with the 4096 x 4096

i managed to make regex for the long link

(https://media.sketchfab.com)/urls/[a-z0-9]+/dist/textures/[a-z0-9]+/[a-z0-9]+.jpeg





但不确定如果任何人可以提供帮助,如何让它下载所有图像取决于大小muc h appriated真的很感谢这个感谢提前elfenliedtopfan5



我尝试了什么:





but not sure how to get it to download all images depending on size if anyone could help would be much appriated really suck at this thanks in advance elfenliedtopfan5

What I have tried:

(https://media.sketchfab.com)/urls/[a-z0-9]+/dist/textures/[a-z0-9]+/[a-z0-9]+.jpeg







string urlAddress = "https://sketchfab.com/3d-models/mossberg-590-tactical-ea1adc30399045a2b101e16ba65a856f";
string urlBase = "https://sketchfab.com";

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
string data = "";
if (response.StatusCode == HttpStatusCode.OK)
{
    Stream receiveStream = response.GetResponseStream();
    StreamReader readStream = null;
    if (response.CharacterSet == null)
        readStream = new StreamReader(receiveStream);
    else
        readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
    data = readStream.ReadToEnd();
    response.Close();
    readStream.Close();
}
MatchCollection matches = Regex.Matches(data, @"(https://media.sketchfab.com)/urls/[a-z0-9]+/dist/textures/[a-z0-9]+/[a-z0-9]+.jpeg");
for (int a = 0; a < matches.Count; a++)
    MessageBox.Show(urlBase + matches[a].Groups["link"].Value);

推荐答案

正如DerekTP123所说,不要使用正则表达式。使用JSON对象 - 您可以根据需要使用宽度,高度和URL。它将使代码更易于阅读,更易于维护,并且更不容易出错。



当然你必须学习如何处理JSON,但是因为JSON是无处不在 - 你也可以学习它并立即获得好处。

最受欢迎的库是JSON.NET - 所以我建议你坚持这一点,因为你会发现大多数使用它的例子。微软也使用它。



谷歌类似:Json.Net教程。



一般来说你有两个选项 - 编写一个代表JSON的C#对象并让JSON.NET填写它。最初打字多一点,但随后很容易使用。查看更多此处 [ ^ ]。



或者你可以读取对象作为JsonObject。然后你必须自己索引所有属性。您不需要任何代表json的C#类 - 但是您必须确保在使用它时键入正确的属性名称 - 编译器不会帮助您。一些示例此处 [ ^ ]。



如果 - 正如你在评论中提到的那样 - 由于一个或另一个原因访问JSON并不容易,可以写一个正则表达式,只是准备好定期摆弄它以保持工作。



我创造了一个似乎有用的例子:

As DerekTP123 said, don't use regex for this. Work with the JSON object - you have width, height and URL just as you need it. it will make the code simpler to read, easier to maintain, and less error prone.

Sure you will have to learn how to deal with JSON, but as JSON is everywhere - you can as well learn it and get the benefits right away.
The most popular library is JSON.NET - so I recommend you stick to that, as you will find most examples using this. Microsoft use it as well.

Google something like: Json.Net tutorial.

In general you have two options - Write a C# object representing the JSON and have JSON.NET fill it out. A bit more typing initially, but then it is easy to use. See more here[^].

Alternatively you can read the object as a JsonObject. Then you have to index all the properties yourself. You won't need any C# classes representing the json - but you will have to make sure you type the right property names when using it - the compiler won't help you. Some examples here[^].

If - as you mention in the comment - it isn't easy to access the JSON for one or another reason, it is possible to write a regex, just be prepared to fiddle with it regularly to keep it working.

I have created an example that seems to work:
"width"\s*:\s*(?'width'\d+).+?"height"\s*:\s*(?'height'\d+).+?"url"\s*:\s*\"(?'url'.+?)"



它依赖于属性的顺序json对象 - 你不应该做的事情,因为属性顺序在json中并不重要 - 这意味着编写代码的代码可以在没有明显原因的情况下改变它 - 但是编写一个简单的正则表达式并不容易t可以处理重新排序。



当正则表达式时,正则表达式非常简单。首先它查找width,一个冒号(周围有可选的空格),然后是一个命名标题组'width',取下一个数字。括号内的?'width'定义了名称。这是没有必要的,但它可以更容易以一种强大的方式提取值 - 因为未来的更改可能会添加其他组(一组在括号中)。



然后跳过height,它以相同的方式捕获。请注意,跳过其他json属性是使用。+?完成的。尾随告诉正则表达式在第一个可能的机会停止(所以它第一次匹配height。如果没有这个,正则表达式是贪婪的 - 所以它会尽可能多地匹配。这将使它读取文本中的第一个宽度,然后一直跳到最后一个高度 - 你只会得到单个匹配。



最后它跳到url并创建一个新组,捕获其中的文本以下引号 - 再次使用非贪婪的匹配,以确保它停在第一个引号,而不是一次吃掉你的整个文本。



你可以玩制作当然,它可以更加严格地避免误报。我建议您使用在线验证器快速查看更改结果,可能是 regex101 .com [ ^ ]



一旦你有了URL,它就是ea sy下载图片(只要网站不试图阻止你)。



我建议你看看 WebClient [ ^ ]



具体是OpenRead,DownloadData和DownloadFile方法。您可以使用它们中的任何一个,但根据您对图像的处理方式,最有可能提供比其他两个更方便的输出。您还可以使用WebClient替换HttpWebRequest / HttpWebResponse。它将完成为您读取响应流的所有工作(基本上它只包装HttpWebRequest / Response并为您完成所有无聊的工作)。


It relies on the order of the properties in the json object - something you really shouldn't do as the property order isn't significant in json - meaning the code writing it could change it around for no apparent reason - but it is not easy to write a simple regex that can handle reordering.

The regex is pretty straight forward as regexes go. First it looks for "width", a colon (with optional whitespace around it), followed by a named caption group 'width' taking the next digits. The ?'width' just inside the parenthesis defines the name. It is not necessary, but it makes it easier to extract the values later in a robust way - as future changes could add additional groups (a group being anything in parenthesis).

It then skips until "height" which is captured the same way. Notice the skipping over other json properties is done using .+?. The trailing ? tells the regex to "stop" at the first possible opportunity (so the first time it can match "height". Without this, the regex is "greedy" - so it will match as much as it can. This would make it read the first width in your text, then skip all the way to the last height - and you would only get a single match.

Finally it skips to "url" and creates a new group capturing the text inside the following quotes - again using a non-greedy match to make sure it stop at the first quote instead of eating your entire text in one go.

You can play with making it more restrictive to avoid false positives of course. I recommend you use an online validator to quickly see the result of your changes, maybe something like regex101.com[^]

Once you have the URL it is easy to download the image (as long as the site does not try to stop you).

I recommend you look at WebClient[^]

Specifically the methods OpenRead, DownloadData, and DownloadFile. You can use any of them, but depending on what you want to do with the image, one will most likely offer a more convenient output than the other two. You can also replace your HttpWebRequest/HttpWebResponse with a WebClient. It will do all work reading the response stream for you (basically it just wraps the HttpWebRequest/Response and will do all the boring stuff for you).


这篇关于使用正则表达式翻录复杂HTML文件的图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆