将html文本解析为结构. [英] Parsing html text into structure.

查看：59 发布时间：2019/6/21 15:36:45 C# HTML

本文介绍了将html文本解析为结构.的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我无法将此html文本解析为结构.
我想将以下html文本解析为该结构:

  struct 结果
{
  公共  int 代码；
  公共 字符串子；
  公共 字符串等级；
}

作业将如下所示:

 result.code =  176 
result.sub = " 
result.grade = "

 <   TR  > 
<   TD     =   align   =  中间 >>  <   STRONG  >   176  <  /STRONG  > ;  <  /TD  > 
<   TD     =   width   ="    对齐  =  左 >  <   STRONG  >  化学 /STRONG  >  <  /TD  > 
<   TD     =   align   =  中间 >>  <   STRONG  >   A- <  /STRONG   >  <  /TD  > 
<  /TR  >

感谢所有人.

已更新:
我正在尝试什么?
只是尝试从网站上下载所有结果并保存在本地数据库中，对于我所在的地区来说可能是12000个结果，而不是整个国家.我非常接近完成使用自己的代码的过程.但是从CP那里，如果我找到任何合适的/简化的方法，那将是一个很大的帮助.我已经参加了GPA.现在，主题将被解析.
帮我一个忙.

 HSC 2010结果发布

卷号124450
注册号719662
学术会议2008-09
姓名NASRIN LIPI
父亲的名字MD.莫扎姆·侯赛因
研究所名称REBATI MOHAN UCHCHA MADHYAMIK BIDYALYA，SIDDHIRGONJ
中心名称NARAYANGANJ-4，GOVT. ADAMJEENAGAR M.W.学院
学生团体科学
学生类型常规
结果通过
GPA 5.00

学科成绩/成绩单
代码科目成绩/分数
107中文A
174物理A +
176化学A +
178生物A +
127数学A +

解决方案

每个表行都可以很容易地解析.
(我假设没有< td>块是空的)
它不是万无一失的，只能与< tr>项，并且未调试(这里没有VS).

 字符串 []行= HTML.Split(新 字符串 [] {" ，" }))； // 我认为这是允许的，不确定.
列出结果结果= 新 List< result>();
 foreach (字符串行<行中的行)
{
  // 声明一些临时变量.
  字符串 code = 字符串 .Empty;
  字符串 sub = 字符串 .Empty;
  字符串等级= 字符串 .Empty;
   bool  inTag =  false ;

   for ( int  i =  0 ; i <  row.Length; i ++)
  {
    如果(行[i] == ' <' )
      inTag =  true ;
    其他 如果(row [i] == ' >')
      inTag =  false ;
    其他 如果(！inTag)//  inTag为true时，您之间的<和>字符.
    {
      如果(code.Length ==  0 )// 是否已定义代码"?
      {
          代码= row.Substring(i，row.IndexOf(' <'，i)-i) ; // 从行中获取文本，从i开始，直到下一个< 
          我+ =代码.长度; // 防止双打
      }
      其他 如果(子长度==  0 )// 'sub'是否已定义?
      {
          sub = row.Substring(i，row.IndexOf(' <'，i)-1) ;
          i + = sub.Length;
      }
      其他 如果(等级.长度==  0 )//是否已定义' 成绩'?
      {
          等级= row.Substring(i，row.IndexOf(' <'，i)-1) ;
          我+ =等级.长度;
      }
    }
  }
  如果(code.Length！=  0 )// 最后检查
     results.Add(新 result( int  .Parse(code)，sub，grade));
}

我知道它的回答不公平.但是我要去做.
第一:
从网上下载所有结果.
查询字符串为:

 字符串 url = "  + roll +   + year;

从循环中调用它:

for (i = startfrom; i <= endat; i++)
{
                HSC_WEB hw = new HSC_WEB(i, year);// This will download in separate thread and            
                hw.MyEvent += new HSC_WEB.ProgressDelegate(hw_Completed);// fire when finish
                TotalReq += 1;
................
}

第二:使用网络浏览器将html文本转换为纯文本.它会删除所有标签和注释.
第三:使用该纯文本(在我的问题中给出的示例文本)，下面的类找到了我想要的所有内容.

 使用系统；
使用 System.Collections.Generic；
使用 System.Linq;
使用 System.Text；

命名空间 HSC_RES_Downloader
{
    类字符串搜索
    {
        公共 stringsearch( string  BaseString)
        {
            此 .BaseString = BaseString;
        }
        公共 字符串 BaseString；
        公共 字符串 getyear()
        {
            返回 getstring(BaseString，"  ，" ，" 结果")；
        }
        公共 字符串 getroll()
        {
            返回 getstring(BaseString，" ，" ，" 注册")；
        }
        公共 字符串 getregno()
        {
            返回 getstring(BaseString，" ，" ，" 学术会议")；
        }
        公共 字符串 getSession()
        {
            返回 getstring(BaseString，" ，" ，" 名称")；
        }
        公共 字符串 getname()
        {
            返回 getstring(BaseString，" ，" ，" 父亲的名字" )；
        }
        公共 字符串 getfname()
        {
            返回 getstring(BaseString，" ，" ，" 研究所")；
        }
        公共 字符串 getInstitutenane()
        {
            返回 getstring(BaseString，" ，" ，" 居中")；
        }
        公共 字符串 getCenter()
        {
            返回 getstring(BaseString，" ，" ，" 学生组")；
        }
        公共 字符串 getGroup()
        {
            返回 getstring(BaseString，" ，" ，" 学生类型")；
        }
        公共 字符串 getsType()
        {
            返回 getstring(BaseString，" ，" ，" 结果")；
        }
        公共 字符串 getResult()
        {
            返回 getstring(BaseString，" ，" ，"  GPA")；
        }
        公共 字符串 getGPA()
        {
            返回 getstring(BaseString，"  ，" ，" 按主题")；
        }

        公共列表<列表<字符串>> subjectgpa()
        {
            List< string>子列表
            
            字符串 substr = "  +
                    "  +
                    "  +
                    " " ；
            sublist = substr.Split(' ，').ToList();
            
             int  ps1 =  0 ;
            ps1 = BaseString.IndexOf(" )；
            substr = BaseString.Substring(ps1 + "  .Length);
            字符串 gpa = 字符串 .Empty；
            List< List< string>> subgpas = 新 List< List< string>>();
             foreach (字符串子名称 子列表中的子名称)
            {
                ps1 = substr.IndexOf(SubName);
                如果(ps1 >   0 )
                {
                    List< string> subgpa = 新列表< string>();
                    gpa = substr.Substring(ps1 + SubName.Length， 2 );
                    subgpa.Add(gpa.Trim());
                    subgpa.Add(SubName);
                    subgpas.Add(subgpa);
                }
            }
            返回 subgpas；
        }

        公共 字符串 getstring( string  basestr， string  str1， string  str2， string 结束字符串)
        {
             int  ps1 =  0 ，ps2 =  0  ，ps3 =  0 ;
            ps1 = basestr.IndexOf(str1);
            ps2 = basestr.IndexOf(str2，ps1 + str1.Length);
            ps3 = basestr.IndexOf(endstring，ps2);
            字符串 ss = basestr.Substring(ps2 + str2.Length，ps3-ps2-str2.Length);
            ss = ss.Trim();
            返回 ss；
        }

    }

}

我还有很多事情要做.任何修改将不胜感激.

I''m having trouble parsing this html text into a structure.
I want to parse the below html text into this structure:

struct result
{
  public   int code;
  public string sub;
  public string grade;
}

The assignment will be like this:

result.code=176
result.sub="CHEMISTRY"
result.grade="A-"

<TR>
<TD bgColor=#fefefe align=middle><STRONG>176</STRONG></TD>
<TD bgColor=#fafafa width="70%" align=left><STRONG>CHEMISTRY</STRONG></TD>
<TD bgColor=#fefefe align=middle><STRONG>A- </STRONG></TD>
</TR>

Thanks to all.

Updated:
What I''m trying?
Just trying to download all the results from a website and save in local database.It could be 12k results for my District only not whole country. I''m very close to complete using my own code. But from CP if I find any proper way/simplified way that would be a great help. I''ve already gone to GPA. Now Subject is to be parsed.
Do me a favor.

HSC 2010 Result Publication

Roll No.  124450 
Registration No.  719662 
Academic Session  2008-09 
Name  NASRIN LIPI  
Father's Name  MD. MOAZZAM HOSSAIN  
Institute Name  REBATI MOHAN UCHCHA MADHYAMIK BIDYALYA, SIDDHIRGONJ  
Center Name  NARAYANGANJ - 4, GOVT. ADAMJEENAGAR M. W. COLLEGE  
Student Group  SCIENCE  
Student Type  REGULAR  
Result  PASSED 
GPA  5.00 

Subject-wise Grade/ Mark Sheet
Code  Subject  Grade/ Marks  
107 ENGLISH A  
174 PHYSICS A+  
176 CHEMISTRY A+  
178 BIOLOGY A+  
127 MATHEMATICS A+

解决方案

Each table row can be parsed pretty easily.
(I am assuming that no <td> block is ever empty)
It''s not foolproof, should only be used with a list of <tr> items and it''s not debugged (haven''t got VS here).

string[] rows = HTML.Split(new string[] { "<tr>", "</tr>"} ); //I think this is allowed, not sure.
List<result> results = new List<result>();
foreach (string row in rows)
{
  //Declaring a few temporary variables.
  string code = string.Empty;
  string sub = string.Empty;
  string grade = string.Empty;
  bool inTag = false;

  for (int i = 0; i < row.Length; i++)
  {
    if(row[i] == '<')
      inTag = true;
    else if (row[i] == '>')
      inTag = false;
    else if (!inTag) //inTag is true when your between the < and > characters.
    {
      if (code.Length == 0) //is 'code' already defined?
      {
          code = row.Substring(i, row.IndexOf('<',i)-i); //get text from row, starting at i and stopping at the next occurance of <
          i += code.Length; //prevent doubles
      }
      else if (sub.Length == 0) //is 'sub' already defined?
      {
          sub = row.Substring(i, row.IndexOf('<',i)-1);
          i += sub.Length;
      }
      else if (grade.Length == 0)/ /is 'grade' already defined?
      {
          grade = row.Substring(i, row.IndexOf('<',i)-1);
          i += grade.Length;
      }
    }
  }
  if (code.Length != 0) //Last checkup
     results.Add(new result(int.Parse(code), sub, grade));
}

I know its not fair answering owns question. But I''m going to do that.
First:
Download all results from web.
The query string is:

string url = "http://www.educationboardresults.gov.bd/arch/result.php?roll=" + roll + "&board=dhaka&exam=HSC&year=" + year;

Call it from a loop:

for (i = startfrom; i <= endat; i++)
{
                HSC_WEB hw = new HSC_WEB(i, year);// This will download in separate thread and            
                hw.MyEvent += new HSC_WEB.ProgressDelegate(hw_Completed);// fire when finish
                TotalReq += 1;
................
}

Second: Purse html text to plain text using webbrowser. It removes all tags and comments.
Third:With that plain text(sample text given in my question) the following class find all I wanted.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace HSC_RES_Downloader
{
    class stringsearch
    {
        public stringsearch( string  BaseString)
        {
            this.BaseString = BaseString;
        }
        public string BaseString;
        public string getyear()
        {
            return getstring(BaseString, "HSC", "", "Result");
        }
        public string getroll()
        {
            return getstring(BaseString, "Roll No.", "", "Registration");
        }
        public string getregno()
        {
            return getstring(BaseString, "Registration No.", "", "Academic Session");
        }
        public string getSession()
        {
            return getstring(BaseString, "Academic Session", "", "Name");
        }
        public string getname()
        {
            return getstring(BaseString, "Academic Session", "Name", "Father's Name");
        }
        public string getfname()
        {
            return getstring(BaseString, "Father's Name", "", "Institute");
        }
        public string getInstitutenane()
        {
            return getstring(BaseString, "Institute Name", "", "Center");
        }
        public string getCenter()
        {
            return getstring(BaseString, "Center Name", "", "Student Group");
        }
        public string getGroup()
        {
            return getstring(BaseString, "Student Group", "", "Student Type");
        }
        public string getsType()
        {
            return getstring(BaseString, "Student Type", "", "Result");
        }
        public string getResult()
        {
            return getstring(BaseString, "Student Type", "Result", "GPA");
        }
        public string getGPA()
        {
            return getstring(BaseString, "GPA", "", "Subject-wise");
        }

        public List<List<string>> subjectsgpa()
        { 
            List<string> sublist;
            
            string substr="BENGALI,ENGLISH,SECRETARIAL MANAGEMENT,COMMERCIAL GEOGRAPHY,"+
                    "STATISTICS,COMPUTER STUDIES,AGRICULTURE STUDIES," +
                    "PRINCIPLE OF BUSINESS,ACCOUNTING,"+
                    "PHYSICS,CHEMISTRY,MATHEMATICS,BIOLOGY,"+
                    "SOCIAL WELFARE,ISLAMIC HISTORY,ISLAMIC STUDIES,CIVICS";
            sublist = substr.Split(',').ToList();          
            
            int ps1 = 0;
            ps1 = BaseString.IndexOf("Code Subject Grade/ Marks");
            substr = BaseString.Substring(ps1 + "Code Subject Grade/ Marks".Length);
            string gpa=string.Empty ;
            List<List<string>> subgpas = new List<List<string>>();
            foreach (string SubName in sublist)
            {
                ps1 = substr.IndexOf(SubName);
                if (ps1 > 0)
                {
                    List<string> subgpa = new List<string>();
                    gpa = substr.Substring(ps1+SubName.Length,2);
                    subgpa.Add(gpa.Trim());
                    subgpa.Add( SubName);
                    subgpas.Add(subgpa);
                }
            }
            return  subgpas ;
        }

        public string getstring(string basestr, string str1, string str2, string endstring)
        {
            int ps1 = 0, ps2 = 0, ps3 = 0;
            ps1 = basestr.IndexOf(str1);
            ps2 = basestr.IndexOf(str2, ps1 + str1.Length);
            ps3 = basestr.IndexOf(endstring, ps2);
            string ss = basestr.Substring(ps2 + str2.Length, ps3 - ps2 - str2.Length);
            ss = ss.Trim();
            return ss;
        }

    }

}

I still have a lot of things to do. Any modification will be appreciated.

Also try to use HtmlAgilityPack which is very usefull to deal with html parsing and processing.

这篇关于将html文本解析为结构.的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将html文本解析为结构. [英] Parsing html text into structure.

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

将html文本解析为结构. [英] Parsing html text into structure.

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭