C ++中的HTML数据提取 [英] HTML Data Extract In C++

查看:75
本文介绍了C ++中的HTML数据提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,

我有一个项目,可以阅读电子邮件(HTML格式)并从电子邮件中提取某些信息,例如参考号,金额等.
检索电子邮件后,将其存储到char缓冲区中.
电子邮件包含所有HTML标记等.请参见下文.

我想知道如何提取HTML数据而不是HTML标签.
EG:
< HTML> Hello World</HTML>
我要提取"Hello World"部分.

我想比较每个字符,如果一个字符放在尖括号``<''或``>''中,我将丢弃该字符,这样我将拥有所有其他数据.

这是最有效的方法,因为我们希望收到大量电子邮件.

提前谢谢.
_____

Hi All,

I have a project to read an email (HTML format) and extract certain information from the email, such as reference numbers, amounts etc..
Once I have retrieve the email, I store this into a char buffer.
The email contains all the HTML tags etc.. See below.

I would like to know, how can I extract the HTML data and not the HTML tags.
EG:
<HTML>Hello World</HTML>
I want to extract the ''Hello World'' part.

I thought of comparing each character and if a character is in angle brackets ''<'' or ''>'' I will discard the character thus I would have all other data.

Is this the most efficient method, since we expect high volumes of emails.

Thanks in advance.
_____

<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head>
<body lang=EN-US link=blue vlink=purple>
<div class=WordSection1><p class=MsoNormal>
<o:p>&nbsp;</o:p></p>
<div align=center>
<table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=720 style='width:540.0pt'>
<tr style='height:129.75pt'>
<td style='padding:0cm 0cm 0cm 0cm;height:129.75pt'>
<p class=MsoNormal>
<img width=720 height=173 id="_x0000_i1026" src="cid:image001.jpg@01CB8683.F336E550" alt="Standard Bank"><o:p></o:p></p></td>
</tr><tr><td width=718 style='width:538.5pt;background:#2E77BA;padding:0cm .75pt 0cm .75pt'>
<div align=center><table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=705 style='width:528.75pt'><tr>
<td style='background:white;padding:7.5pt 7.5pt 7.5pt 7.5pt'><p><b>
<span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:navy'><br></span></b>
<strong>
<span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:navy'>Business Online deposit received</span>
</strong><o:p></o:p></p><p class=MsoNormal>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Dear </span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#4F81BD'>&lt;&lt;preferredName&gt;&gt;</span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'><br>
<br>A deposit has been received for your Standard Bank account number </span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#4F81BD'>&lt;&lt;ACC NO&gt;&gt;</span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>.<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>
<o:p>&nbsp;</o:p></span></p><p class=MsoNormal>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>The details are as follows:<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'><o:p>&nbsp;</o:p></span></p>
<table class=MsoNormalTable border=0 cellspacing=1 cellpadding=0 width="95%" style='width:95.42%;background:#3D5378'>
<tr><td width="10%" valign=top style='width:10.46%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Currency<o:p></o:p></span>
</b></p></td><td width="21%" valign=top style='width:21.84%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Amount<o:p></o:p>
</span></b></p></td><td style='background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal><b>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Value Date</span></b><o:p></o:p></p>
</td><td width="26%" valign=top style='width:26.52%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>
<b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Reference<o:p></o:p></span></b></p></td>
<td width="24%" style='width:24.56%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal><b>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Message ID</span></b><o:p></o:p></p></td>
</tr><tr style='height:12.1pt'>
<td width="10%" valign=top style='width:10.46%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'>
<p class=MsoNormal align=right style='text-align:right'><span style='font-family:"Arial","sans-serif"'>R<o:p>
</o:p></span></p></td>
<td width="21%" valign=top style='width:21.84%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'>
<p class=MsoNormal><span style='font-family:"Arial","sans-serif"'>2860.00<o:p></o:p></span></p></td>
<td style='background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'><p class=MsoNormal>



[edit]tried to fix the formatting, but something seems amiss[/edit]
[edit2] fixed the formatting [/edit2]



[edit]tried to fix the formatting, but something seems amiss[/edit]
[edit2] fixed the formatting [/edit2]

推荐答案

This does exactly what you suggest, but see the space problem with your sample data :(
This does exactly what you suggest, but see the space problem with your sample data :(
bool InTag(char c)
{
    static int bracket = 0;
    switch (c)
    {
    case '<':
        ++bracket;
        break;
    case '>':
        --bracket;
        return true;
    }
    return bracket > 0;
}

#include <fstream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <iostream>
int main()
{
    std::ifstream in("Test.htm");
    std::ostringstream oss;
    std::remove_copy_if(std::istream_iterator<char>(in), std::istream_iterator<char>(), std::ostream_iterator<char>(oss), InTag);
    std::cout << oss.str() << std::endl;
    return 0;
}


欢呼声,
AR


cheers,
AR


One of the most fastest and easiest is just to implement it very straight forward if you need what you propose. The rules you defined are to stop when a ''<'' is encountered and to start when a ''>'' is encountered, so that is what you should do :)

simply scan through the string and when you encounter an ''<'' you stop copying and when you encounter a ''>'' it is time to start copying again.

One of the most fastest and easiest is just to implement it very straight forward if you need what you propose. The rules you defined are to stop when a ''<'' is encountered and to start when a ''>'' is encountered, so that is what you should do :)

simply scan through the string and when you encounter an ''<'' you stop copying and when you encounter a ''>'' it is time to start copying again.

The pseudo code would be something like this:
while not end of string {
   while not curchar == ''<'' and not end of string {
      copy character
      move to next character
    }
   move to next character /* skip ''<'' character */
   while not curchar == ''>'' and not end of string {
     move to next character
   }
   move to next character /* skip ''>'' character */
}



Good luck!



Good luck!


Thanks for the suggestions, I implemented the following.


<pre lang="cs">void CBank::Extract(CString HTML)
{
char *Buffer;
int BufferSize = 0;
char *Temp;
int StartPos = 0;
int EndPos = 0;
int TempSize = 0;
BufferSize = HTML.GetLength();
Buffer = new char[BufferSize + 1];
memset(Buffer,0,BufferSize + 1);
memcpy(Buffer,HTML.GetBuffer(),BufferSize);
for (int i=0;i&lt;BufferSize;i++)
{
if (Buffer[i] == &#39;&lt;&#39;)
{
i ++;
for (int k = i; k &lt; BufferSize; k++)
{
if(Buffer[k] == &#39;&gt;&#39;)
{
i = k;
休息;
}
}
}
if ((Buffer[i] == &#39;&gt;&#39;) &amp;&amp; (Buffer[i+1] != &#39;&lt;&#39;) &amp;&amp; (Buffer[i+1] != 0x0d))//Carriage Return
{
StartPos = 0;
EndPos = 0;
i ++; //Buffer[i] == &#39;&gt;&#39; so Buffer[i++] != &#39;&gt;&#39;
StartPos = i;
for (int j = i; j &lt; BufferSize; j++)
{
if (Buffer[j] == &#39;&lt;&#39;)//Found the start of a tag
{
i = j;
EndPos = j;
TempSize = EndPos - StartPos;
Temp = new char[TempSize + 1];
memset(Temp,0,TempSize + 1);
memcpy(Temp,&amp;Buffer[StartPos],TempSize);
delete []Temp;
休息;
}
}
}
}
if (Buffer)
{
delete []Buffer;
Buffer = NULL;
}
}</pre>
Thanks for the suggestions, I implemented the following.


<pre lang="cs">void CBank::Extract(CString HTML)
{
char *Buffer;
int BufferSize = 0;
char *Temp;
int StartPos = 0;
int EndPos = 0;
int TempSize = 0;
BufferSize = HTML.GetLength();
Buffer = new char[BufferSize + 1];
memset(Buffer,0,BufferSize + 1);
memcpy(Buffer,HTML.GetBuffer(),BufferSize);
for (int i=0;i&lt;BufferSize;i++)
{
if (Buffer[i] == &#39;&lt;&#39;)
{
i++;
for (int k = i; k &lt; BufferSize; k++)
{
if(Buffer[k] == &#39;&gt;&#39;)
{
i = k;
break;
}
}
}
if ((Buffer[i] == &#39;&gt;&#39;) &amp;&amp; (Buffer[i+1] != &#39;&lt;&#39;) &amp;&amp; (Buffer[i+1] != 0x0d))//Carriage Return
{
StartPos = 0;
EndPos = 0;
i++; //Buffer[i] == &#39;&gt;&#39; so Buffer[i++] != &#39;&gt;&#39;
StartPos = i;
for (int j = i; j &lt; BufferSize; j++)
{
if (Buffer[j] == &#39;&lt;&#39;)//Found the start of a tag
{
i = j;
EndPos = j;
TempSize = EndPos - StartPos;
Temp = new char[TempSize + 1];
memset(Temp,0,TempSize + 1);
memcpy(Temp,&amp;Buffer[StartPos],TempSize);
delete []Temp;
break;
}
}
}
}
if (Buffer)
{
delete []Buffer;
Buffer = NULL;
}
}</pre>


这篇关于C ++中的HTML数据提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆