GOOGLE NEWS PARSER [英] GOOGLE NEWS PARSER

查看:52
本文介绍了GOOGLE NEWS PARSER的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我意识到自己是个白痴。整个星期六都浪费在这个疯狂的正则表达式废话上,我不能再看看它了... ...你好b $ b老实说,我甚至愿意付钱给那些能够解决这个问题。

请帮助,我真的使用了我的所有知识,我无法得到它来工作。


让我解释一下:


我需要从GOOGLE新闻中解析,URL,URL文本和描述文字。


每个标题都在谷歌新闻具有相同的结构,它看起来像这样:


< a class =" y"

href =" http:// news .google.com / url?ntc = 05SA0& q = http://www.canada.com/sports/st

ory.html%3Fid%3D5FBD7D23-AA7A-4E4C-AE1D-01CEBC350782" >每个人都希望能够出现这种情况 - 而不是队友Kloden< / a>< br>< font size =" -1"

style =" font-family:arial,sans-serif">< b>

< font color ="#6f6f6f" style =" font-family:

arial,sans-serif"> Canada.com& nbsp; - < / font> 15& nbsp; minutes& nbsp; ago< / b>< ; br>

BESANCON,法国(美联社) - 寻找一位德国自行车手参加巡回演出de b / b
周日法国领奖台 - 不是大多数人所期待的。 < br>


去吧: http://news.google.com/news/en/us/sports.html 如果需要,请查看

来源。

所以,这是我的抓取链接的解析器(函数)。它的工作原理(谢谢

上帝)我能用这个功能提取所有20个主要标题网址

链接:


公共函数ParseLinks(ByVal HTML As String)As ArrayList

Dim objRegEx As System.Text.RegularExpressions.Regex


Dim objMatch As System.Text.RegularExpressions .Match


Dim arrLinks As New System.Collections.ArrayList

objRegEx = New System.Text.RegularExpressions.Regex("(? :y

[hH] [rR] [eE] [fF] \s * =)(?:[\ s""''] *)(?!#| [Mm ] ailto | [lL] ocation。| [jJ] avascript |。

* css |。* this \。)(。*?)(?:[\ s>"" ''])",

System.Text.RegularExpressions.RegexOptions.Ignore Case Or

System.Text.RegularExpressions.RegexOptions.Compil ed)
< br $> b $ b objMatch = objRegEx.Match(HTML)

objMatch.Success


Dim strMatch As String


strMatch = objMatch.Groups(1).To字符串


arrLinks.Add(strMatch)

objMatch = objMatch.NextMatch()


结束时


返回arrLinks


结束功能


现在您可能已经猜到了,我的问题是我根本无法用b $ b来编写提取URL TEXT和DESCRIPTION TEXT的相同功能。


请帮助


K.

解决方案

Krakatioison,

url,url测试和描述是什么意思文本。告诉我你需要哪些
部分。

Jared


" Krakatioison" <氪********** @ huh.com>在消息中写道

新闻:41 ********** @ Usenet.com ...

我意识到我是个白痴。整个星期六浪费在这个疯狂的正则表达式废话上,我甚至不能再看它... uff
老实说,我甚至愿意付钱给能够解决这个问题的人。 />请帮助,我真的使用了我所有的知识,我无法得到它来工作。

让我解释一下:

我需要解析,来自GOOGLE新闻的网址,网址文字和说明文字。

Google新闻中的每个标题都具有相同的结构,看起来像这样:

< a class =" y"
href =" http://news.google.com/url?ntc = 05SA0& q = http://www.canada.com/sports/st
ory。 HTML%3Fid%3D5FBD7D23-AA7A-4E4C-AE1D-01CEBC350782">每个人都希望Ullrich出现 - 而不是队友Kloden< / a>< br>< font size =" -1"
style =" font-family:arial,sans- serif">< b>
< font color ="#6f6f6f" style =" font-family:
arial,sans-serif"> Canada.com& nbsp; - < / font> 15& nbsp; minutes& nbsp; ago< / b>< br> <法国BESANCON(美联社) - 寻找一位德国自行车手参加周日的法国领奖台 - 这不是大多数人所期待的。 < br>

去吧: http://news.google.com/news/en/us/sports.html 并在需要时查找来源。

所以,这是我的解析器(函数)用于抓取链接。并且它有效(感谢上帝)并且我能够使用此功能提取所有20个主要标题
URL
链接:

公共功能ParseLinks(ByVal HTML As String)As ArrayList

Dim objRegEx As System.Text.RegularExpressions.Regex

Dim objMatch As System.Text.RegularExpressions.Match

Dim arrLinks As New System.Collections.ArrayList

objRegEx = New System.Text.RegularExpressions.Regex("(?:y
[hH] [rR] [eE] [fF] \ s * =)(?:[\ s""''] *)(?!#| [Mm] ailto | [lL] ocation。| [jJ] avascript |。
* css |。*这个。(。*?)(?:[\ s>""''])",
System.Text.RegularExpressions.RegexOptions.Ignore Case Or
System。 Text.RegularExpressions.RegexOptions.Compil ed)

objMatch = objRegEx.Match(HTML)

而objMatch.Success

Dim strMatch As String

strMatch = objMatch.Groups(1).ToString

arrLinks.Add(strMatch)

objMatch = objMat ch.NextMatch()

结束时

返回arrLinks

结束功能

现在您可能已经猜到了,我的问题是我根本无法为提取URL TEXT和描述
TEXT编写相同的功能。

请帮助
< K.



嗨Jared,

所以当你看看这个google html:


< a class =" y"

href =" http://news.google.com/url?ntc = 05SA0& q = http ://www.canada.com/sports/st

ory.html%3Fid%3D5FBD7D23-AA7A-4E4C-AE1D-01CEBC350782">每个人都希望能够出现这种情况 - 而不是队友Kloden< / a>< br>< font size =" -1"

style =" font-family:arial,sans-serif">< b>

< font color ="#6f6f6f" style =" font-family:

arial,sans-serif"> Canada.com& nbsp; - < / font> 15& nbsp; minutes& nbsp; ago< / b>< ; br>

BESANCON,法国(美联社) - 寻找一位德国自行车手参加巡回演出de b / b
周日法国领奖台 - 不是大多数人所期待的。 < br>


这是我需要得到的:


链接:
http://news.google.com/url?ntc=05SA0...D-01CEBC350782

链接文字:每个人都希望Ullrich出现 - 而不是队友

Kloden

描述:法国BESANCON(美联社) - 寻找一位德国自行车手参加星期天在法国环法自行车赛的领奖台 - 只是不是大多数人期望的那个。


这就是我需要的。

我的解析器得到链接。


但我需要获得链接文字和说明。


上帝保佑你们男人如果你能从代码中提取它。


K 。


盖伊的,

我会付钱给我们


I realized I am an idiot. Whole saturday wasted on this crazy regex crap, I
can''t even look at it anymore...uff
Honestly, I am even willing to pay to someone who is able to solve this.
Please help, I really used all my knowledge on this and I cannot get it to
work.

Let me explain:

I need to parse, URL , URL TEXT and DESCRIPTION TEXT from GOOGLE NEWS.

Every headline in google news has the same structure and it looks like this:

<a class="y"
href="http://news.google.com/url?ntc=05SA0&q=http://www.canada.com/sports/st
ory.html%3Fid%3D5FBD7D23-AA7A-4E4C-AE1D-01CEBC350782"> Everyone expected
Ullrich to show up - instead it was teammate Kloden</a><br><font size="-1"
style="font-family: arial,sans-serif"><b>
<font color="#6f6f6f" style="font-family:
arial,sans-serif">Canada.com&nbsp;-</font>15&nbsp;minutes&nbsp;ago</b><br>
BESANCON, France (AP) - Look for a German cyclist to be on the Tour de
France podium Sunday - just not the one most people expected. <br>

go to lets say: http://news.google.com/news/en/us/sports.html and look up
the source if needed.
So, this is my parser (function) for grabbing links. And it works (thanks
God) and I am able to extract with this function all 20 major headlines URL
links:

Public Function ParseLinks(ByVal HTML As String) As ArrayList
Dim objRegEx As System.Text.RegularExpressions.Regex

Dim objMatch As System.Text.RegularExpressions.Match

Dim arrLinks As New System.Collections.ArrayList

objRegEx = New System.Text.RegularExpressions.Regex("(?:y
[hH][rR][eE][fF]\s*=)(?:[\s""'']*)(?!#|[Mm]ailto|[lL]ocation.|[jJ]avascript|.
*css|.*this\.)(.*?)(?:[\s>""''])",
System.Text.RegularExpressions.RegexOptions.Ignore Case Or
System.Text.RegularExpressions.RegexOptions.Compil ed)

objMatch = objRegEx.Match(HTML)

While objMatch.Success

Dim strMatch As String

strMatch = objMatch.Groups(1).ToString

arrLinks.Add(strMatch)

objMatch = objMatch.NextMatch()

End While

Return arrLinks

End Function

Now as you probably guessed already, my problem is that I am simply not able
to write the same function for extraction of URL TEXT and DESCRIPTION TEXT.

please help

K.

解决方案

Krakatioison,
What do you mean by url, url test and description text. show me which
portions you need.
Jared

"Krakatioison" <Kr**********@huh.com> wrote in message
news:41**********@Usenet.com...

I realized I am an idiot. Whole saturday wasted on this crazy regex crap, I
can''t even look at it anymore...uff
Honestly, I am even willing to pay to someone who is able to solve this.
Please help, I really used all my knowledge on this and I cannot get it to
work.

Let me explain:

I need to parse, URL , URL TEXT and DESCRIPTION TEXT from GOOGLE NEWS.

Every headline in google news has the same structure and it looks like
this:

<a class="y"
href="http://news.google.com/url?ntc=05SA0&q=http://www.canada.com/sports/st
ory.html%3Fid%3D5FBD7D23-AA7A-4E4C-AE1D-01CEBC350782"> Everyone expected
Ullrich to show up - instead it was teammate Kloden</a><br><font size="-1"
style="font-family: arial,sans-serif"><b>
<font color="#6f6f6f" style="font-family:
arial,sans-serif">Canada.com&nbsp;-</font>15&nbsp;minutes&nbsp;ago</b><br>
BESANCON, France (AP) - Look for a German cyclist to be on the Tour de
France podium Sunday - just not the one most people expected. <br>

go to lets say: http://news.google.com/news/en/us/sports.html and look up
the source if needed.
So, this is my parser (function) for grabbing links. And it works (thanks
God) and I am able to extract with this function all 20 major headlines
URL
links:

Public Function ParseLinks(ByVal HTML As String) As ArrayList
Dim objRegEx As System.Text.RegularExpressions.Regex

Dim objMatch As System.Text.RegularExpressions.Match

Dim arrLinks As New System.Collections.ArrayList

objRegEx = New System.Text.RegularExpressions.Regex("(?:y
[hH][rR][eE][fF]\s*=)(?:[\s""'']*)(?!#|[Mm]ailto|[lL]ocation.|[jJ]avascript|.
*css|.*this\.)(.*?)(?:[\s>""''])",
System.Text.RegularExpressions.RegexOptions.Ignore Case Or
System.Text.RegularExpressions.RegexOptions.Compil ed)

objMatch = objRegEx.Match(HTML)

While objMatch.Success

Dim strMatch As String

strMatch = objMatch.Groups(1).ToString

arrLinks.Add(strMatch)

objMatch = objMatch.NextMatch()

End While

Return arrLinks

End Function

Now as you probably guessed already, my problem is that I am simply not
able
to write the same function for extraction of URL TEXT and DESCRIPTION
TEXT.

please help

K.



Hi Jared,
so when you look at this google html:

<a class="y"
href="http://news.google.com/url?ntc=05SA0&q=http://www.canada.com/sports/st
ory.html%3Fid%3D5FBD7D23-AA7A-4E4C-AE1D-01CEBC350782"> Everyone expected
Ullrich to show up - instead it was teammate Kloden</a><br><font size="-1"
style="font-family: arial,sans-serif"><b>
<font color="#6f6f6f" style="font-family:
arial,sans-serif">Canada.com&nbsp;-</font>15&nbsp;minutes&nbsp;ago</b><br>
BESANCON, France (AP) - Look for a German cyclist to be on the Tour de
France podium Sunday - just not the one most people expected. <br>

This is what I need to get:

LINK:
http://news.google.com/url?ntc=05SA0...D-01CEBC350782
LINKS TEXT: Everyone expected Ullrich to show up - instead it was teammate
Kloden
DESCRIPTION: BESANCON, France (AP) - Look for a German cyclist to be on the
Tour de France podium Sunday - just not the one most people expected.

That is what I need.
My parser gets the LINK.

But I need to get LINKS TEXT and DESCRIPTION.

God bless you men If you can extract it from the code.

K.


Guy''s,
I''ll pay us


这篇关于GOOGLE NEWS PARSER的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆