通过html字符串迭代查找所有img标记并替换src属性值 [英] Iterate through an html string to find all img tags and replace the src attribute values

查看:542
本文介绍了通过html字符串迭代查找所有img标记并替换src属性值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个html代码作为字符串。我需要在该字符串中找到所有img标记,读取每个src属性的值并将其传递给函数,该函数返回一个整个img标记,需要取代读取的img标记。

I have an html code as a string. I need to find all img tags in that string, read the value of each src attribute and pass it to a function, that function returns an entire img tag that needs to take the place of the img tag that was read.

它需要遍历整个字符串并为所有img标签执行相同的逻辑。

It needs to iterate through the whole string and execute the same logic for all img tags.

例如,假设我的html字符串如下所示:

For example, suppose that my html string looks like this:

string htmlBody= "<p>Hi everyone</p><img src=\"..." <p>I am here </p> <img src=\"..." />"

我有以下代码找到第一个img标签,取src值(这是一个base64字符串)并将其转换为位数组以创建流,然后我可以创建一个链接到该流的新src值。

I have the following code which finds the first img tag, takes the src value (which is a base64 string) and convert it into an array of bits to create an stream, then i can create a new src value which link to that stream.

  //Remove from all src attributes "data:image/png;base64"      
  string res = Regex.Replace(htmlBody, "data:image\\/\\w+\\;base64\\,", "");
  //Match the img tag and get the base64  string value
  string matchString = Regex.Match(res, "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
  var imageData = Convert.FromBase64String(matchString);
  var contentId = Guid.NewGuid().ToString();
  LinkedResource inline = new LinkedResource(new MemoryStream(imageData), "image/jpeg");
  inline.ContentId = contentId;
  inline.TransferEncoding = TransferEncoding.Base64;
  //Replace all img tags with the new img tag 
  htmlBody = Regex.Replace(htmlBody, "<img.+?src=[\"'](.+?)[\"'].*?>", @"<img src='cid:" + inline.ContentId + @"'/>");

正如你所看到的那样,我有新的img标签要替换:

As you can see finnaly i have got the new img tag to replace:

   <img src='cid:" + inline.ContentId + @"'/>

但代码将用相同的内容替换所有img标签。我需要能够获取img标签,执行逻辑,替换它,然后继续使用下一个img标签。

But the code will replace all the img tag with the same content. I need to be able to get the img tag, execute the logic, replace it and then, continue with the next img tag.

希望你能给我一个想法我能做到在此先感谢。

Hope you can give me an idea how i can do that. Thanks in advance.

推荐答案

如果我理解你的需要,你可以使用HtmlAgilityPack来达到这个目的。使用正则表达式可能会导致不必要的行为你能试试下面的代码吗?

If I understand your need correctly you can use HtmlAgilityPack for this purpose. Using regex may cause unwanted behavior. Can you try the code below ?

public static string DoIt()
{
        string htmlString = "";
        using (WebClient client = new WebClient())
            htmlString = client.DownloadString("http://dean.edwards.name/my/base64-ie.html"); //This is an example source for base64 img src, you can change this directly to your source.

        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(htmlString);
        document.DocumentNode.Descendants("img")
                            .Where(e =>
                            {
                                string src = e.GetAttributeValue("src", null) ?? "";
                                return !string.IsNullOrEmpty(src) && src.StartsWith("data:image");
                            })
                            .ToList()
                            .ForEach(x =>
                            {
                                string currentSrcValue = x.GetAttributeValue("src", null);
                                currentSrcValue = currentSrcValue.Split(',')[1];//Base64 part of string
                                byte[] imageData = Convert.FromBase64String(currentSrcValue);
                                string contentId = Guid.NewGuid().ToString();
                                LinkedResource inline = new LinkedResource(new MemoryStream(imageData), "image/jpeg");
                                inline.ContentId = contentId;
                                inline.TransferEncoding = TransferEncoding.Base64;

                                x.SetAttributeValue("src", "cid:" + inline.ContentId);
                            });


        string result = document.DocumentNode.OuterHtml;
}

您可以从 https://www.nuget.org/packages/HtmlAgilityPack

希望这有助于

这篇关于通过html字符串迭代查找所有img标记并替换src属性值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆