将HTML页解析为父 - 子对象C# [英] Parsing HTML Page into Parent-Child Object C#

查看:104
本文介绍了将HTML页解析为父 - 子对象C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析html页面,并且我对这种解析不熟悉,你能否建议我解析下面的html



HTML代码: http://notepad.cc/share/CFRURbrk3r



对于每种类型的房间,都有分房间列表,所以我希望将它们作为父母 - 孩子列入对象列表。然后我们可以访问每个孩子。



这是代码尽我所能,但不添加到对象,除了Fizzler是否有其他解析器我可以在这种情况下做。

  var uricontent = File.ReadAllText(TestHtml / Bew.html); 
var html = new HtmlDocument(); // HTML敏捷包
html.LoadHtml(uricontent);
var doc = html.DocumentNode;
var房间=(从r中的doc.QuerySelectorAll(。rates)
from r.QuerySelectorAll(。rooms)
from r.inR.QuerySelectorAll(。rate )
选择新的
{
Name = rd.QuerySelector(。rate-description)。InnerText.CleanInnerText(),
Price = r.QuerySelector(。rate -price)。InnerText.CleanInnerText(),
RoomType = s.QuerySelector(tr td h2)。InnerText.CleanInnerText()
})。ToArray();


解决方案

更新 p>

就我个人而言,我不会使用数组。我会使用 List 。执行 List 应该允许您将特定节点添加到特定位置并进行相应分组。 然后你可以简单地:

  • 查找

  • 排序

  • 选择



  • 这将允许您快速过滤内容。由于每个列表项都被存储。一些示例






    更新:

    忘了提及, Html Agility Pack 可以执行以下操作:


    • 抓取特定的节点/元素。 / li>
    • 获取父节点和所有后续的子节点/元素。

      $ b

      它还可以获取远程或本地网页。






      实际上,我会从Nuget下载 Html Agility Pack 。它非常强大而且功能强大,它可能更容易清理所需的数据。您可以按照以下步骤下载它:


      • 转至工具

      • 转至 Nuget包管理器

      • 选择包管理器控制台
      • em>包管理器控制台,如果它未打开,则位于Visual Studio左下角。
      • 键入以下命令 Install-Package HtmlAgilityPack



      一个很好的例子可以从这个 question

      前提很简单:

        HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument(); 

      //将文档映射到Html页面。
      document.Load(filePath);

      //如果您希望通过Xml String来完成,您是否需要它。

      if(document.DocumentNode!= null)
      {
      HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode(// body);
      if(bodyNode!= null)
      {
      //用bodyNode做些事情。






      这个例子显示了语法,但它应该是远的更容易从页面抓取特定节点,并相应地使用 HtmlAgilityPack 来操作它。



      希望这能让你朝着更好的方向发展。


      I'm parsing the html page, and I'm new to this kind of parsing, could you suggest me the idea to parse following html

      HTML Code : http://notepad.cc/share/CFRURbrk3r

      for each type of room, there are list of sub rooms so I wish to group them as Parent - Childs into the List of Objects. then later we can access to each of those childs.

      this is the code as far as I could do but without adding to the Objects, besides Fizzler is there any other parser I can do in this case.

      var uricontent = File.ReadAllText("TestHtml/Bew.html"); 
      var html = new HtmlDocument(); // with HTML Agility pack         
      html.LoadHtml(uricontent);                      
      var doc = html.DocumentNode;                      
      var rooms = (from r in doc.QuerySelectorAll(".rates")                             
                   from s in r.QuerySelectorAll(".rooms")                           
                   from rd in r.QuerySelectorAll(".rate")                           
                   select new 
                   {                  
                      Name = rd.QuerySelector(".rate-description").InnerText.CleanInnerText(), 
                      Price = r.QuerySelector(".rate-price").InnerText.CleanInnerText(),
                      RoomType = s.QuerySelector("tr td h2").InnerText.CleanInnerText()   
                   }).ToArray();    
      

      解决方案

      Update:

      Personally, I wouldn't use an Array. I would use a List. The implementation of a List should allow you to add particular nodes into particular positions and grouped accordingly.

      Then you could simply:

      • Loop (foreach)
      • Find
      • Sort
      • Select

      Which would allow you to quickly filter through the content. Since each list item is stored. Some examples.


      Update:

      Another item I forgot to mention, the Html Agility Pack can do the following:

      • Grab a particular node / element.
      • Grab a Parent and all subsequent Children node / elements.

      It can also grab remote or local pages.


      I would actually download the Html Agility Pack from Nuget. It is incredibly powerful and robust, it will more than likely make it even easier to scrub the desired data. You can download it by following these steps:

      • Go to Tools
      • Go to Nuget Package Manager
      • Select Package Manager Console
      • Open the Package Manager Console in lower left of Visual Studio if it didn't open.
      • Type the following command Install-Package HtmlAgilityPack.

      A great example can be found from this question.

      The premise is simple:

      HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
      
      // Map the document to the Html Page.
      document.Load(filePath);
      
      // If you would rather do it through Xml String, should you require it.
      
      if (document.DocumentNode != null)
      {
           HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
           if( bodyNode != null)
           {
                 // Do something with bodyNode.
           }
      }
      

      This example shows the syntax, but it should be far easier to grab particular nodes out of the page and manipulate it accordingly with the HtmlAgilityPack.

      Hopefully this points you in a better direction.

      这篇关于将HTML页解析为父 - 子对象C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆