解析网址从一个HTML页面 [英] Parse URLs out of a HTML page

查看:202
本文介绍了解析网址从一个HTML页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含通过 WinHtt preadData 下载HTML页面的字符串。该字符串是一个简单的的char * 。结果
我一直在试图找出一种方法,只提取URL的是该网页上。给你举个例子,假设你的字WinHTTP的搜索谷歌,你是一个HTML页面充满链接psented $ P $。现在我需要检查每一个环节,提取并保存到一个文件中。

I have a string containing an HTML page downloaded via WinHttpReadData. The string is a simple char*.
I've been trying to figure a way to extract only the URL's that are on that page. To give you an example, imagine you are searching google for the word WinHTTP and you are presented with an HTML page full of links. I need now to check each link, extract it and save it to a file.

我试图寻找 HREF 的http:// 等关键字,然后尝试将所有提取字符串一直到< / A> ,但它不是真正的工作。这将是也是不错的摆脱该URL的说明(如< A HREF =htt​​p://someurl.com/somepage.html>有的文字< / A> GET 一些文本),但它不是作为URL本身。

I tried searching for HREF, http:// and other keywords and then try to extract the string all the way to the </a> but it's not really working. It'll be nice also to get the description out that URL (like <a href="http://someurl.com/somepage.html">some text</a> get some text) but it's not as important as the URL itself.

这里的棘手的事情是,我无法使用第三方库,因为我不希望有处理许可证等等。

The tricky thing here is that I cannot use 3rd party libraries since I don't want to have to deal with licenses and the like.

这是如何做到这一点任何想法?是否WinHTTP的提供一种方式来做到这一点?在C(不可以 C ++)?

Any ideas on how to do this? Does WinHTTP provide a way to do this? in C (not C++)?

感谢您的帮助。

推荐答案

也许你应该去的PCRE C API(可上的 PCRE网站

Maybe you should go for the PCRE C API (Available on PCRE site)

你需要将像正则表达式:

The regex you'll need will be like :

&LT; A * HREF =?'](小于?URL方式&gt;?*)[,'] *&GT;?(LT;名称&gt; *。 ?)LT; / A&GT;

这应该映射太集团&LT; URL&GT; &LT;名称&gt; 组结构中

This should map too group <url> and <name> within the group structure.

这篇关于解析网址从一个HTML页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆