将html文本解析为结构. [英] Parsing html text into structure.
问题描述
我无法将此html文本解析为结构.
我想将以下html文本解析为该结构:
struct 结果
{
公共 int 代码;
公共 字符串子;
公共 字符串等级;
}
作业将如下所示:
result.code = 176
result.sub = "
result.grade = "
< TR >
< TD = align = 中间 >> < STRONG > 176 < /STRONG > ; < /TD >
< TD = width =" 对齐 = 左 > < STRONG > 化学 /STRONG > < /TD >
< TD = align = 中间 >> < STRONG > A- < /STRONG > < /TD >
< /TR >
感谢所有人.
已更新:
我正在尝试什么?
只是尝试从网站上下载所有结果并保存在本地数据库中,对于我所在的地区来说可能是12000个结果,而不是整个国家.我非常接近完成使用自己的代码的过程.但是从CP那里,如果我找到任何合适的/简化的方法,那将是一个很大的帮助.我已经参加了GPA.现在,主题将被解析.
帮我一个忙.
HSC 2010结果发布
卷号124450
注册号719662
学术会议2008-09
姓名NASRIN LIPI
父亲的名字MD.莫扎姆·侯赛因
研究所名称REBATI MOHAN UCHCHA MADHYAMIK BIDYALYA,SIDDHIRGONJ
中心名称NARAYANGANJ-4,GOVT. ADAMJEENAGAR M.W.学院
学生团体科学
学生类型常规
结果通过
GPA 5.00
学科成绩/成绩单
代码科目成绩/分数
107中文A
174物理A +
176化学A +
178生物A +
127数学A +
每个表行都可以很容易地解析.
(我假设没有< td>块是空的)
它不是万无一失的,只能与< tr>项,并且未调试(这里没有VS).
字符串 []行= HTML.Split(新 字符串 [] {" ," })); // 我认为这是允许的,不确定. 列出结果结果= 新 List< result>(); foreach (字符串行<行中的行) { // 声明一些临时变量. 字符串 code = 字符串 .Empty; 字符串 sub = 字符串 .Empty; 字符串等级= 字符串 .Empty; bool inTag = false ; for ( int i = 0 ; i < row.Length; i ++) { 如果(行[i] == ' <' ) inTag = true ; 其他 如果(row [i] == ' >') inTag = false ; 其他 如果(!inTag)// inTag为true时,您之间的<和>字符. { 如果(code.Length == 0 )// 是否已定义代码"? { 代码= row.Substring(i,row.IndexOf(' <',i)-i) ; // 从行中获取文本,从i开始,直到下一个< 我+ =代码.长度; // 防止双打 } 其他 如果(子长度== 0 )// 'sub'是否已定义? { sub = row.Substring(i,row.IndexOf(' <',i)-1) ; i + = sub.Length; } 其他 如果(等级.长度== 0 )//是否已定义' 成绩'? { 等级= row.Substring(i,row.IndexOf(' <',i)-1) ; 我+ =等级.长度; } } } 如果(code.Length!= 0 )// 最后检查 results.Add(新 result( int .Parse(code),sub,grade)); }
我知道它的回答不公平.但是我要去做.
第一:
从网上下载所有结果.
查询字符串为:
字符串 url = " + roll + + year;
从循环中调用它:
for (i = startfrom; i <= endat; i++) { HSC_WEB hw = new HSC_WEB(i, year);// This will download in separate thread and hw.MyEvent += new HSC_WEB.ProgressDelegate(hw_Completed);// fire when finish TotalReq += 1; ................ }
第二:使用网络浏览器将html文本转换为纯文本.它会删除所有标签和注释.
第三:使用该纯文本(在我的问题中给出的示例文本),下面的类找到了我想要的所有内容.
使用系统; 使用 System.Collections.Generic; 使用 System.Linq; 使用 System.Text; 命名空间 HSC_RES_Downloader { 类字符串搜索 { 公共 stringsearch( string BaseString) { 此 .BaseString = BaseString; } 公共 字符串 BaseString; 公共 字符串 getyear() { 返回 getstring(BaseString," ," ," 结果"); } 公共 字符串 getroll() { 返回 getstring(BaseString," ," ," 注册"); } 公共 字符串 getregno() { 返回 getstring(BaseString," ," ," 学术会议"); } 公共 字符串 getSession() { 返回 getstring(BaseString," ," ," 名称"); } 公共 字符串 getname() { 返回 getstring(BaseString," ," ," 父亲的名字" ); } 公共 字符串 getfname() { 返回 getstring(BaseString," ," ," 研究所"); } 公共 字符串 getInstitutenane() { 返回 getstring(BaseString," ," ," 居中"); } 公共 字符串 getCenter() { 返回 getstring(BaseString," ," ," 学生组"); } 公共 字符串 getGroup() { 返回 getstring(BaseString," ," ," 学生类型"); } 公共 字符串 getsType() { 返回 getstring(BaseString," ," ," 结果"); } 公共 字符串 getResult() { 返回 getstring(BaseString," ," ," GPA"); } 公共 字符串 getGPA() { 返回 getstring(BaseString," ," ," 按主题"); } 公共列表<列表<字符串>> subjectgpa() { List< string>子列表 字符串 substr = " + " + " + " " ; sublist = substr.Split(' ,').ToList(); int ps1 = 0 ; ps1 = BaseString.IndexOf(" ); substr = BaseString.Substring(ps1 + " .Length); 字符串 gpa = 字符串 .Empty; List< List< string>> subgpas = 新 List< List< string>>(); foreach (字符串子名称 子列表中的子名称) { ps1 = substr.IndexOf(SubName); 如果(ps1 > 0 ) { List< string> subgpa = 新列表< string>(); gpa = substr.Substring(ps1 + SubName.Length, 2 ); subgpa.Add(gpa.Trim()); subgpa.Add(SubName); subgpas.Add(subgpa); } } 返回 subgpas; } 公共 字符串 getstring( string basestr, string str1, string str2, string 结束字符串) { int ps1 = 0 ,ps2 = 0 ,ps3 = 0 ; ps1 = basestr.IndexOf(str1); ps2 = basestr.IndexOf(str2,ps1 + str1.Length); ps3 = basestr.IndexOf(endstring,ps2); 字符串 ss = basestr.Substring(ps2 + str2.Length,ps3-ps2-str2.Length); ss = ss.Trim(); 返回 ss; } } }
我还有很多事情要做.任何修改将不胜感激.I''m having trouble parsing this html text into a structure.
I want to parse the below html text into this structure:
struct result { public int code; public string sub; public string grade; }
The assignment will be like this:
result.code=176 result.sub="CHEMISTRY" result.grade="A-"<TR> <TD bgColor=#fefefe align=middle><STRONG>176</STRONG></TD> <TD bgColor=#fafafa width="70%" align=left><STRONG>CHEMISTRY</STRONG></TD> <TD bgColor=#fefefe align=middle><STRONG>A- </STRONG></TD> </TR>
Thanks to all.
Updated:
What I''m trying?
Just trying to download all the results from a website and save in local database.It could be 12k results for my District only not whole country. I''m very close to complete using my own code. But from CP if I find any proper way/simplified way that would be a great help. I''ve already gone to GPA. Now Subject is to be parsed.
Do me a favor.
HSC 2010 Result Publication Roll No. 124450 Registration No. 719662 Academic Session 2008-09 Name NASRIN LIPI Father's Name MD. MOAZZAM HOSSAIN Institute Name REBATI MOHAN UCHCHA MADHYAMIK BIDYALYA, SIDDHIRGONJ Center Name NARAYANGANJ - 4, GOVT. ADAMJEENAGAR M. W. COLLEGE Student Group SCIENCE Student Type REGULAR Result PASSED GPA 5.00 Subject-wise Grade/ Mark Sheet Code Subject Grade/ Marks 107 ENGLISH A 174 PHYSICS A+ 176 CHEMISTRY A+ 178 BIOLOGY A+ 127 MATHEMATICS A+解决方案Each table row can be parsed pretty easily.
(I am assuming that no <td> block is ever empty)
It''s not foolproof, should only be used with a list of <tr> items and it''s not debugged (haven''t got VS here).
string[] rows = HTML.Split(new string[] { "<tr>", "</tr>"} ); //I think this is allowed, not sure. List<result> results = new List<result>(); foreach (string row in rows) { //Declaring a few temporary variables. string code = string.Empty; string sub = string.Empty; string grade = string.Empty; bool inTag = false; for (int i = 0; i < row.Length; i++) { if(row[i] == '<') inTag = true; else if (row[i] == '>') inTag = false; else if (!inTag) //inTag is true when your between the < and > characters. { if (code.Length == 0) //is 'code' already defined? { code = row.Substring(i, row.IndexOf('<',i)-i); //get text from row, starting at i and stopping at the next occurance of < i += code.Length; //prevent doubles } else if (sub.Length == 0) //is 'sub' already defined? { sub = row.Substring(i, row.IndexOf('<',i)-1); i += sub.Length; } else if (grade.Length == 0)/ /is 'grade' already defined? { grade = row.Substring(i, row.IndexOf('<',i)-1); i += grade.Length; } } } if (code.Length != 0) //Last checkup results.Add(new result(int.Parse(code), sub, grade)); }
I know its not fair answering owns question. But I''m going to do that.
First:
Download all results from web.
The query string is:
string url = "http://www.educationboardresults.gov.bd/arch/result.php?roll=" + roll + "&board=dhaka&exam=HSC&year=" + year;
Call it from a loop:
for (i = startfrom; i <= endat; i++) { HSC_WEB hw = new HSC_WEB(i, year);// This will download in separate thread and hw.MyEvent += new HSC_WEB.ProgressDelegate(hw_Completed);// fire when finish TotalReq += 1; ................ }
Second: Purse html text to plain text using webbrowser. It removes all tags and comments.
Third:With that plain text(sample text given in my question) the following class find all I wanted.
using System; using System.Collections.Generic; using System.Linq; using System.Text; namespace HSC_RES_Downloader { class stringsearch { public stringsearch( string BaseString) { this.BaseString = BaseString; } public string BaseString; public string getyear() { return getstring(BaseString, "HSC", "", "Result"); } public string getroll() { return getstring(BaseString, "Roll No.", "", "Registration"); } public string getregno() { return getstring(BaseString, "Registration No.", "", "Academic Session"); } public string getSession() { return getstring(BaseString, "Academic Session", "", "Name"); } public string getname() { return getstring(BaseString, "Academic Session", "Name", "Father's Name"); } public string getfname() { return getstring(BaseString, "Father's Name", "", "Institute"); } public string getInstitutenane() { return getstring(BaseString, "Institute Name", "", "Center"); } public string getCenter() { return getstring(BaseString, "Center Name", "", "Student Group"); } public string getGroup() { return getstring(BaseString, "Student Group", "", "Student Type"); } public string getsType() { return getstring(BaseString, "Student Type", "", "Result"); } public string getResult() { return getstring(BaseString, "Student Type", "Result", "GPA"); } public string getGPA() { return getstring(BaseString, "GPA", "", "Subject-wise"); } public List<List<string>> subjectsgpa() { List<string> sublist; string substr="BENGALI,ENGLISH,SECRETARIAL MANAGEMENT,COMMERCIAL GEOGRAPHY,"+ "STATISTICS,COMPUTER STUDIES,AGRICULTURE STUDIES," + "PRINCIPLE OF BUSINESS,ACCOUNTING,"+ "PHYSICS,CHEMISTRY,MATHEMATICS,BIOLOGY,"+ "SOCIAL WELFARE,ISLAMIC HISTORY,ISLAMIC STUDIES,CIVICS"; sublist = substr.Split(',').ToList(); int ps1 = 0; ps1 = BaseString.IndexOf("Code Subject Grade/ Marks"); substr = BaseString.Substring(ps1 + "Code Subject Grade/ Marks".Length); string gpa=string.Empty ; List<List<string>> subgpas = new List<List<string>>(); foreach (string SubName in sublist) { ps1 = substr.IndexOf(SubName); if (ps1 > 0) { List<string> subgpa = new List<string>(); gpa = substr.Substring(ps1+SubName.Length,2); subgpa.Add(gpa.Trim()); subgpa.Add( SubName); subgpas.Add(subgpa); } } return subgpas ; } public string getstring(string basestr, string str1, string str2, string endstring) { int ps1 = 0, ps2 = 0, ps3 = 0; ps1 = basestr.IndexOf(str1); ps2 = basestr.IndexOf(str2, ps1 + str1.Length); ps3 = basestr.IndexOf(endstring, ps2); string ss = basestr.Substring(ps2 + str2.Length, ps3 - ps2 - str2.Length); ss = ss.Trim(); return ss; } } }
I still have a lot of things to do. Any modification will be appreciated.
Also try to use HtmlAgilityPack which is very usefull to deal with html parsing and processing.
这篇关于将html文本解析为结构.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!