C ++中的HTML数据提取 [英] HTML Data Extract In C++

查看：75 发布时间：2019/6/22 0:10:37 C++

本文介绍了C ++中的HTML数据提取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

大家好，

我有一个项目，可以阅读电子邮件(HTML格式)并从电子邮件中提取某些信息，例如参考号，金额等.
检索电子邮件后，将其存储到char缓冲区中.
电子邮件包含所有HTML标记等.请参见下文.

我想知道如何提取HTML数据而不是HTML标签.
EG:
< HTML> Hello World</HTML>
我要提取"Hello World"部分.

我想比较每个字符，如果一个字符放在尖括号``<''或``>''中，我将丢弃该字符，这样我将拥有所有其他数据.

这是最有效的方法，因为我们希望收到大量电子邮件.

提前谢谢.
_____

Hi All,

I have a project to read an email (HTML format) and extract certain information from the email, such as reference numbers, amounts etc..
Once I have retrieve the email, I store this into a char buffer.
The email contains all the HTML tags etc.. See below.

I would like to know, how can I extract the HTML data and not the HTML tags.
EG:
<HTML>Hello World</HTML>
I want to extract the ''Hello World'' part.

I thought of comparing each character and if a character is in angle brackets ''<'' or ''>'' I will discard the character thus I would have all other data.

Is this the most efficient method, since we expect high volumes of emails.

Thanks in advance.
_____

<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head>
<body lang=EN-US link=blue vlink=purple>
<div class=WordSection1><p class=MsoNormal>
<o:p>&nbsp;</o:p></p>
<div align=center>
<table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=720 style='width:540.0pt'>
<tr style='height:129.75pt'>
<td style='padding:0cm 0cm 0cm 0cm;height:129.75pt'>
<p class=MsoNormal>
<img width=720 height=173 id="_x0000_i1026" src="cid:image001.jpg@01CB8683.F336E550" alt="Standard Bank"><o:p></o:p></p></td>
</tr><tr><td width=718 style='width:538.5pt;background:#2E77BA;padding:0cm .75pt 0cm .75pt'>
<div align=center><table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=705 style='width:528.75pt'><tr>
<td style='background:white;padding:7.5pt 7.5pt 7.5pt 7.5pt'><p><b>
<span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:navy'><br></span></b>
<strong>
<span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:navy'>Business Online deposit received</span>
</strong><o:p></o:p></p><p class=MsoNormal>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Dear </span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#4F81BD'>&lt;&lt;preferredName&gt;&gt;</span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'><br>
<br>A deposit has been received for your Standard Bank account number </span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#4F81BD'>&lt;&lt;ACC NO&gt;&gt;</span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>.<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>
<o:p>&nbsp;</o:p></span></p><p class=MsoNormal>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>The details are as follows:<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'><o:p>&nbsp;</o:p></span></p>
<table class=MsoNormalTable border=0 cellspacing=1 cellpadding=0 width="95%" style='width:95.42%;background:#3D5378'>
<tr><td width="10%" valign=top style='width:10.46%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Currency<o:p></o:p></span>
</b></p></td><td width="21%" valign=top style='width:21.84%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Amount<o:p></o:p>
</span></b></p></td><td style='background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal><b>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Value Date</span></b><o:p></o:p></p>
</td><td width="26%" valign=top style='width:26.52%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>
<b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Reference<o:p></o:p></span></b></p></td>
<td width="24%" style='width:24.56%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal><b>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Message ID</span></b><o:p></o:p></p></td>
</tr><tr style='height:12.1pt'>
<td width="10%" valign=top style='width:10.46%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'>
<p class=MsoNormal align=right style='text-align:right'><span style='font-family:"Arial","sans-serif"'>R<o:p>
</o:p></span></p></td>
<td width="21%" valign=top style='width:21.84%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'>
<p class=MsoNormal><span style='font-family:"Arial","sans-serif"'>2860.00<o:p></o:p></span></p></td>
<td style='background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'><p class=MsoNormal>

[edit]tried to fix the formatting, but something seems amiss[/edit]
[edit2] fixed the formatting [/edit2]

推荐答案

This does exactly what you suggest, but see the space problem with your sample data :(

bool InTag(char c)
{
    static int bracket = 0;
    switch (c)
    {
    case '<':
        ++bracket;
        break;
    case '>':
        --bracket;
        return true;
    }
    return bracket > 0;
}

#include <fstream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <iostream>
int main()
{
    std::ifstream in("Test.htm");
    std::ostringstream oss;
    std::remove_copy_if(std::istream_iterator<char>(in), std::istream_iterator<char>(), std::ostream_iterator<char>(oss), InTag);
    std::cout << oss.str() << std::endl;
    return 0;
}

欢呼声，
AR

cheers,
AR

One of the most fastest and easiest is just to implement it very straight forward if you need what you propose. The rules you defined are to stop when a ''<'' is encountered and to start when a ''>'' is encountered, so that is what you should do :)

simply scan through the string and when you encounter an ''<'' you stop copying and when you encounter a ''>'' it is time to start copying again.

The pseudo code would be something like this:
while not end of string {
   while not curchar == ''<'' and not end of string {
      copy character
      move to next character
    }
   move to next character /* skip ''<'' character */
   while not curchar == ''>'' and not end of string {
     move to next character
   }
   move to next character /* skip ''>'' character */
}

Good luck!

Thanks for the suggestions, I implemented the following.

<pre lang="cs">void CBank::Extract(CString HTML)
{
char *Buffer;
int BufferSize = 0;
char *Temp;
int StartPos = 0;
int EndPos = 0;
int TempSize = 0;
BufferSize = HTML.GetLength();
Buffer = new char[BufferSize + 1];
memset(Buffer,0,BufferSize + 1);
memcpy(Buffer,HTML.GetBuffer(),BufferSize);
for (int i=0;i<BufferSize;i++)
{
if (Buffer[i] == '<')
{
i ++;
for (int k = i; k < BufferSize; k++)
{
if(Buffer[k] == '>')
{
i = k;
休息；
}
}
}
if ((Buffer[i] == '>') && (Buffer[i+1] != '<') && (Buffer[i+1] != 0x0d))//Carriage Return
{
StartPos = 0;
EndPos = 0;
i ++; //Buffer[i] == '>' so Buffer[i++] != '>'
StartPos = i;
for (int j = i; j < BufferSize; j++)
{
if (Buffer[j] == '<')//Found the start of a tag
{
i = j;
EndPos = j;
TempSize = EndPos - StartPos;
Temp = new char[TempSize + 1];
memset(Temp,0,TempSize + 1);
memcpy(Temp,&Buffer[StartPos],TempSize);
delete []Temp;
休息；
}
}
}
}
if (Buffer)
{
delete []Buffer;
Buffer = NULL;
}
}</pre>

Thanks for the suggestions, I implemented the following.

<pre lang="cs">void CBank::Extract(CString HTML)
{
char *Buffer;
int BufferSize = 0;
char *Temp;
int StartPos = 0;
int EndPos = 0;
int TempSize = 0;
BufferSize = HTML.GetLength();
Buffer = new char[BufferSize + 1];
memset(Buffer,0,BufferSize + 1);
memcpy(Buffer,HTML.GetBuffer(),BufferSize);
for (int i=0;i<BufferSize;i++)
{
if (Buffer[i] == '<')
{
i++;
for (int k = i; k < BufferSize; k++)
{
if(Buffer[k] == '>')
{
i = k;
break;
}
}
}
if ((Buffer[i] == '>') && (Buffer[i+1] != '<') && (Buffer[i+1] != 0x0d))//Carriage Return
{
StartPos = 0;
EndPos = 0;
i++; //Buffer[i] == '>' so Buffer[i++] != '>'
StartPos = i;
for (int j = i; j < BufferSize; j++)
{
if (Buffer[j] == '<')//Found the start of a tag
{
i = j;
EndPos = j;
TempSize = EndPos - StartPos;
Temp = new char[TempSize + 1];
memset(Temp,0,TempSize + 1);
memcpy(Temp,&Buffer[StartPos],TempSize);
delete []Temp;
break;
}
}
}
}
if (Buffer)
{
delete []Buffer;
Buffer = NULL;
}
}</pre>

这篇关于C ++中的HTML数据提取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

C ++中的HTML数据提取 [英] HTML Data Extract In C++

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

C ++中的HTML数据提取 [英] HTML Data Extract In C++

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭