解析URL的信息了HTML<一>< / A>用C标签 [英] Parsing the information of a URL out of a HTML <a></a> tags in C

查看:196
本文介绍了解析URL的信息了HTML<一>< / A>用C标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的应用程序获取其数据的一个大的HTML格式的文件,其中包含大量的链接的一部分。喜欢的东西,如果你搜索的谷歌或雅虎或其他搜索引擎什么,你会得到什么:URL列表,描述或其他文本

My application gets as part of its data a large html formatted file that contains large amounts of links. Something like what you would get if you search anything on Google or Yahoo or other search engines: a list of URLs and the description or other text.

我一直在试图拿出能够解析URL和说明,并将它们保存到一个文本文件中的函数,但它的证明很难,至少对我来说。所以,如果我有:

I've been trying to come out with a function that can parse the URL and the description and save them into a text file but it's proven hard, at least to me. So, if I have:

< A HREF =htt​​p://www.w3schools.com>访问W3Schools的< / A>

我会解析 http://www.w3schools.com 访问W3Schools的,并将其保存在一个文件中

I would parse http://www.w3schools.com and Visit W3Schools and save them in a file.

任何方式实现这一目标?在纯C?结果
任何帮助是AP preciated。

Any way to achieve this? in plain C?
Any help is appreciated.

推荐答案

您真的需要一个适当的HTML解析器,但是对于一些快速和肮脏的,请尝试:

You really need a proper html parser, but for something quick and dirty, try:

bool get_url(char **data, char **url, char **desc)
{
  bool result = false;
  char *ptr = strstr(*data, "<a");

  if(NULL != ptr)
  {
    *data = ptr + 2;

    ptr = strstr(*data, "href=\"");
    if(NULL != ptr)
    {
      *data = ptr + 6;
      *url = *data;

      ptr = strchr(*data, '"');
      if(NULL != ptr)
      {
        *ptr = '\0';
        *data = ptr + 1;

        ptr = strchr(*data, '>');
        if(NULL != ptr)
        {
          *data = ptr + 1;
          *desc = *data;

          ptr = strstr(*data, "</a>");
          if(NULL != ptr)
          {
            *ptr = '\0';
            *data = ptr + 4;
            result = true;
          }
        }
      }
    }
  }

  return result;
}

这并不是说数据被更新,以超出分析的数据(这是一个在out参数),并在传递的字符串被修改。我感觉懒洋洋/太忙了,没完全解决方案和分配的内存返回字符串。

Not that data gets updated to be beyond the data parsed (it's an in-out parameter) and that the string passed in gets modified. I'm feeling lazy/too busy to do full solutions with memory allocated return strings.

另外你可能应该在关闭范围大括号的级联返回错误(除了第一个),这是为什么我把它们堆起来那样。有迹象表明,可以适于更通用的其它更整洁的解决方案。

Also you probably ought to return errors on the cascade of close scope braces (except the first one) which is partly why I stacked them up like that. There are other neater solutions that can be adapted to be more generic.

所以基本上你然后反复调用函数,直到它返回false。

So basically you then call the function repeatedly until it returns false.

这篇关于解析URL的信息了HTML&LT;一&GT;&LT; / A&GT;用C标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆