将html文本解析为结构. [英] Parsing html text into structure.

查看:59
本文介绍了将html文本解析为结构.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法将此html文本解析为结构.
我想将以下html文本解析为该结构:

  struct 结果
{
  公共  int 代码;
  公共 字符串子;
  公共 字符串等级;
} 


作业将如下所示:

 result.code =  176 
result.sub = " 
result.grade = "  

 <   TR  > 
<   TD     =   align   =  中间 >>  <   STRONG  >   176  <  /STRONG  > ;  <  /TD  > 
<   TD     =   width   ="    对齐  =   >  <   STRONG  >  化学 /STRONG  >  <  /TD  > 
<   TD     =   align   =  中间 >>  <   STRONG  >   A- <  /STRONG   >  <  /TD  > 
<  /TR  >  


感谢所有人.

已更新:
我正在尝试什么?
只是尝试从网站上下载所有结果并保存在本地数据库中,对于我所在的地区来说可能是12000个结果,而不是整个国家.我非常接近完成使用自己的代码的过程.但是从CP那里,如果我找到任何合适的/简化的方法,那将是一个很大的帮助.我已经参加了GPA.现在,主题将被解析.
帮我一个忙.


 HSC 2010结果发布

卷号124450
注册号719662
学术会议2008-09
姓名NASRIN LIPI
父亲的名字MD.莫扎姆·侯赛因
研究所名称REBATI MOHAN UCHCHA MADHYAMIK BIDYALYA,SIDDHIRGONJ
中心名称NARAYANGANJ-4,GOVT. ADAMJEENAGAR M.W.学院
学生团体科学
学生类型常规
结果通过
GPA 5.00

学科成绩/成绩单
代码科目成绩/分数
107中文A
174物理A +
176化学A +
178生物A +
127数学A + 

解决方案

每个表行都可以很容易地解析.
(我假设没有< td>块是空的)
它不是万无一失的,只能与< tr>项,并且未调试(这里没有VS).

 字符串 []行= HTML.Split( 字符串 [] {" " })); // 我认为这是允许的,不确定.
列出结果结果=  List< result>();
 foreach (字符串行<行中的行)
{
  // 声明一些临时变量.
  字符串 code = 字符串 .Empty;
  字符串 sub = 字符串 .Empty;
  字符串等级= 字符串 .Empty;
   bool  inTag =  false ;

   for ( int  i =  0 ; i <  row.Length; i ++)
  {
    如果(行[i] == ' <' )
      inTag =  true ;
    其他 如果(row [i] == ' >')
      inTag =  false ;
    其他 如果(!inTag)//  inTag为true时,您之间的<和>字符.
    {
      如果(code.Length ==  0 )// 是否已定义代码"?
      {
          代码= row.Substring(i,row.IndexOf(' <',i)-i) ; // 从行中获取文本,从i开始,直到下一个< 
          我+ =代码.长度; // 防止双打
      }
      其他 如果(子长度==  0 )// 'sub'是否已定义?
      {
          sub = row.Substring(i,row.IndexOf(' <',i)-1) ;
          i + = sub.Length;
      }
      其他 如果(等级.长度==  0 )//是否已定义' 成绩'?
      {
          等级= row.Substring(i,row.IndexOf(' <',i)-1) ;
          我+ =等级.长度;
      }
    }
  }
  如果(code.Length!=  0 )// 最后检查
     results.Add( result( int  .Parse(code),sub,grade));
} 


我知道它的回答不公平.但是我要去做.
第一:
从网上下载所有结果.
查询字符串为:

 字符串 url = "  + roll +   + year; 


从循环中调用它:

for (i = startfrom; i <= endat; i++)
{
                HSC_WEB hw = new HSC_WEB(i, year);// This will download in separate thread and            
                hw.MyEvent += new HSC_WEB.ProgressDelegate(hw_Completed);// fire when finish
                TotalReq += 1;
................
}



第二:使用网络浏览器将html文本转换为纯文本.它会删除所有标签和注释.
第三:使用该纯文本(在我的问题中给出的示例文本),下面的类找到了我想要的所有内容.

 使用系统;
使用 System.Collections.Generic;
使用 System.Linq;
使用 System.Text;

命名空间 HSC_RES_Downloader
{
    字符串搜索
    {
        公共 stringsearch( string  BaseString)
        {
             .BaseString = BaseString;
        }
        公共 字符串 BaseString;
        公共 字符串 getyear()
        {
            返回 getstring(BaseString," " " 结果");
        }
        公共 字符串 getroll()
        {
            返回 getstring(BaseString," " " 注册");
        }
        公共 字符串 getregno()
        {
            返回 getstring(BaseString," " " 学术会议");
        }
        公共 字符串 getSession()
        {
            返回 getstring(BaseString," " " 名称");
        }
        公共 字符串 getname()
        {
            返回 getstring(BaseString," " " 父亲的名字" );
        }
        公共 字符串 getfname()
        {
            返回 getstring(BaseString," " " 研究所");
        }
        公共 字符串 getInstitutenane()
        {
            返回 getstring(BaseString," " " 居中");
        }
        公共 字符串 getCenter()
        {
            返回 getstring(BaseString," " " 学生组");
        }
        公共 字符串 getGroup()
        {
            返回 getstring(BaseString," " " 学生类型");
        }
        公共 字符串 getsType()
        {
            返回 getstring(BaseString," " " 结果");
        }
        公共 字符串 getResult()
        {
            返回 getstring(BaseString," " "  GPA");
        }
        公共 字符串 getGPA()
        {
            返回 getstring(BaseString," " " 按主题");
        }

        公共列表<列表<字符串>> subjectgpa()
        {
            List< string>子列表
            
            字符串 substr = "  +
                    "  +
                    "  +
                    " " ;
            sublist = substr.Split(' ,').ToList();
            
             int  ps1 =  0 ;
            ps1 = BaseString.IndexOf(" );
            substr = BaseString.Substring(ps1 + "  .Length);
            字符串 gpa = 字符串 .Empty;
            List< List< string>> subgpas =  List< List< string>>();
             foreach (字符串子名称 子列表中的子名称)
            {
                ps1 = substr.IndexOf(SubName);
                如果(ps1 >   0 )
                {
                    List< string> subgpa = 列表< string>();
                    gpa = substr.Substring(ps1 + SubName.Length, 2 );
                    subgpa.Add(gpa.Trim());
                    subgpa.Add(SubName);
                    subgpas.Add(subgpa);
                }
            }
            返回 subgpas;
        }

        公共 字符串 getstring( string  basestr, string  str1, string  str2, string 结束字符串)
        {
             int  ps1 =  0 ,ps2 =  0  ,ps3 =  0 ;
            ps1 = basestr.IndexOf(str1);
            ps2 = basestr.IndexOf(str2,ps1 + str1.Length);
            ps3 = basestr.IndexOf(endstring,ps2);
            字符串 ss = basestr.Substring(ps2 + str2.Length,ps3-ps2-str2.Length);
            ss = ss.Trim();
            返回 ss;
        }

    }

} 



我还有很多事情要做.任何修改将不胜感激.

I''m having trouble parsing this html text into a structure.
I want to parse the below html text into this structure:

struct result
{
  public   int code;
  public string sub;
  public string grade;
}


The assignment will be like this:

result.code=176
result.sub="CHEMISTRY"
result.grade="A-"

<TR>
<TD bgColor=#fefefe align=middle><STRONG>176</STRONG></TD>
<TD bgColor=#fafafa width="70%" align=left><STRONG>CHEMISTRY</STRONG></TD>
<TD bgColor=#fefefe align=middle><STRONG>A- </STRONG></TD>
</TR>


Thanks to all.

Updated:
What I''m trying?
Just trying to download all the results from a website and save in local database.It could be 12k results for my District only not whole country. I''m very close to complete using my own code. But from CP if I find any proper way/simplified way that would be a great help. I''ve already gone to GPA. Now Subject is to be parsed.
Do me a favor.


HSC 2010 Result Publication

Roll No.  124450 
Registration No.  719662 
Academic Session  2008-09 
Name  NASRIN LIPI  
Father's Name  MD. MOAZZAM HOSSAIN  
Institute Name  REBATI MOHAN UCHCHA MADHYAMIK BIDYALYA, SIDDHIRGONJ  
Center Name  NARAYANGANJ - 4, GOVT. ADAMJEENAGAR M. W. COLLEGE  
Student Group  SCIENCE  
Student Type  REGULAR  
Result  PASSED 
GPA  5.00 

Subject-wise Grade/ Mark Sheet
Code  Subject  Grade/ Marks  
107 ENGLISH A  
174 PHYSICS A+  
176 CHEMISTRY A+  
178 BIOLOGY A+  
127 MATHEMATICS A+

解决方案

Each table row can be parsed pretty easily.
(I am assuming that no <td> block is ever empty)
It''s not foolproof, should only be used with a list of <tr> items and it''s not debugged (haven''t got VS here).

string[] rows = HTML.Split(new string[] { "<tr>", "</tr>"} ); //I think this is allowed, not sure.
List<result> results = new List<result>();
foreach (string row in rows)
{
  //Declaring a few temporary variables.
  string code = string.Empty;
  string sub = string.Empty;
  string grade = string.Empty;
  bool inTag = false;

  for (int i = 0; i < row.Length; i++)
  {
    if(row[i] == '<')
      inTag = true;
    else if (row[i] == '>')
      inTag = false;
    else if (!inTag) //inTag is true when your between the < and > characters.
    {
      if (code.Length == 0) //is 'code' already defined?
      {
          code = row.Substring(i, row.IndexOf('<',i)-i); //get text from row, starting at i and stopping at the next occurance of <
          i += code.Length; //prevent doubles
      }
      else if (sub.Length == 0) //is 'sub' already defined?
      {
          sub = row.Substring(i, row.IndexOf('<',i)-1);
          i += sub.Length;
      }
      else if (grade.Length == 0)/ /is 'grade' already defined?
      {
          grade = row.Substring(i, row.IndexOf('<',i)-1);
          i += grade.Length;
      }
    }
  }
  if (code.Length != 0) //Last checkup
     results.Add(new result(int.Parse(code), sub, grade));
}


I know its not fair answering owns question. But I''m going to do that.
First:
Download all results from web.
The query string is:

string url = "http://www.educationboardresults.gov.bd/arch/result.php?roll=" + roll + "&board=dhaka&exam=HSC&year=" + year;


Call it from a loop:

for (i = startfrom; i <= endat; i++)
{
                HSC_WEB hw = new HSC_WEB(i, year);// This will download in separate thread and            
                hw.MyEvent += new HSC_WEB.ProgressDelegate(hw_Completed);// fire when finish
                TotalReq += 1;
................
}



Second: Purse html text to plain text using webbrowser. It removes all tags and comments.
Third:With that plain text(sample text given in my question) the following class find all I wanted.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace HSC_RES_Downloader
{
    class stringsearch
    {
        public stringsearch( string  BaseString)
        {
            this.BaseString = BaseString;
        }
        public string BaseString;
        public string getyear()
        {
            return getstring(BaseString, "HSC", "", "Result");
        }
        public string getroll()
        {
            return getstring(BaseString, "Roll No.", "", "Registration");
        }
        public string getregno()
        {
            return getstring(BaseString, "Registration No.", "", "Academic Session");
        }
        public string getSession()
        {
            return getstring(BaseString, "Academic Session", "", "Name");
        }
        public string getname()
        {
            return getstring(BaseString, "Academic Session", "Name", "Father's Name");
        }
        public string getfname()
        {
            return getstring(BaseString, "Father's Name", "", "Institute");
        }
        public string getInstitutenane()
        {
            return getstring(BaseString, "Institute Name", "", "Center");
        }
        public string getCenter()
        {
            return getstring(BaseString, "Center Name", "", "Student Group");
        }
        public string getGroup()
        {
            return getstring(BaseString, "Student Group", "", "Student Type");
        }
        public string getsType()
        {
            return getstring(BaseString, "Student Type", "", "Result");
        }
        public string getResult()
        {
            return getstring(BaseString, "Student Type", "Result", "GPA");
        }
        public string getGPA()
        {
            return getstring(BaseString, "GPA", "", "Subject-wise");
        }

        public List<List<string>> subjectsgpa()
        { 
            List<string> sublist;
            
            string substr="BENGALI,ENGLISH,SECRETARIAL MANAGEMENT,COMMERCIAL GEOGRAPHY,"+
                    "STATISTICS,COMPUTER STUDIES,AGRICULTURE STUDIES," +
                    "PRINCIPLE OF BUSINESS,ACCOUNTING,"+
                    "PHYSICS,CHEMISTRY,MATHEMATICS,BIOLOGY,"+
                    "SOCIAL WELFARE,ISLAMIC HISTORY,ISLAMIC STUDIES,CIVICS";
            sublist = substr.Split(',').ToList();          
            
            int ps1 = 0;
            ps1 = BaseString.IndexOf("Code Subject Grade/ Marks");
            substr = BaseString.Substring(ps1 + "Code Subject Grade/ Marks".Length);
            string gpa=string.Empty ;
            List<List<string>> subgpas = new List<List<string>>();
            foreach (string SubName in sublist)
            {
                ps1 = substr.IndexOf(SubName);
                if (ps1 > 0)
                {
                    List<string> subgpa = new List<string>();
                    gpa = substr.Substring(ps1+SubName.Length,2);
                    subgpa.Add(gpa.Trim());
                    subgpa.Add( SubName);
                    subgpas.Add(subgpa);
                }
            }
            return  subgpas ;
        }

        public string getstring(string basestr, string str1, string str2, string endstring)
        {
            int ps1 = 0, ps2 = 0, ps3 = 0;
            ps1 = basestr.IndexOf(str1);
            ps2 = basestr.IndexOf(str2, ps1 + str1.Length);
            ps3 = basestr.IndexOf(endstring, ps2);
            string ss = basestr.Substring(ps2 + str2.Length, ps3 - ps2 - str2.Length);
            ss = ss.Trim();
            return ss;
        }

    }

}



I still have a lot of things to do. Any modification will be appreciated.


Also try to use HtmlAgilityPack which is very usefull to deal with html parsing and processing.


这篇关于将html文本解析为结构.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆