查找并列出使用Python网页特定链接 [英] Find and list specific links in a webpage using Python

查看：169 发布时间：2016/8/5 19:19:36 python beautifulsoup

本文介绍了查找并列出使用Python网页特定链接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

1.A从网页上的我要让像所有链接的列表之源$ C $ c中的链接presentmypage.php？REF = 1137988
这是mypage.php？REF =后跟一个数字

1.B。然而，这源页面还包含类似Supp.Form.php出头？REF = 1137988我想避免的。

 ＆LT; / TD＆GT;＆LT; / TR＆GT;
＆LT; /表＆gt;
＆LT; FONT CLASS = T＆GT;＆LT;表CELLSPACING = 5＆GT;＆LT; TR＆GT;＆LT; TD的bgcolor =＃FFFFA0风格=边界：5像素脊浅灰色'＆GT;＆LT;表CELLSPACING = 4＆GT;＆LT; TR＆GT;＆LT ; TD VALIGN =顶部和GT;＆LT; FONT CLASS = T2＆GT;＆LT; CENTER＆GT; 2015年9月3日＆LT; BR＆GT;＆LT;表CELLSPACING = 4＆GT;＆LT; TR＆GT;＆LT; TD的bgcolor =＃FFFFFF风格=边界： 4PX脊浅灰色'＆GT;＆LT; CENTER＆GT;＆LT; FONT CLASS = T9＆GT; 1137988＆LT; A HREF ='SuppForm.php REF = 1137988'的目标='_空白？'＆GT;＆LT; IMG SRC ='终扣/ supp.gif WIDTH = 12 HEIGHT = 12 BORDER = 0 TITLE ='删除'＆GT;＆LT; / A＆GT; ＆LT; A HREF ='ModifForm.php REF = 1137988？'目标='_空白'＆GT;＆LT; IMG SRC ='终扣/ modif.gifWIDTH = 10 HEIGHT = 11 BORDER = 0 TITLE ='修改'＆GT;＆LT ; / A＆GT;＆LT; BR＆GT;＆LT;表CELLSPACING = 4＆GT;＆LT; TR＆GT;＆LT; TD的bgcolor =＃FFFFA0风格=边界：4PX脊浅灰色'＆GT;＆LT;表＆gt;＆LT; TR＆GT;＆LT; TD＆GT ;＆LT; IMG SRC ='面临/ F.gifWIDTH = 36 BORDER = 0＆GT;＆LT; / TD＆GT;＆LT; TD＆GT;＆LT; CENTER＆GT;＆LT; FONT SIZE = 1＆GT;年龄＆LT; BR＆GT;＆LT; / FONT＆GT;＆LT ; FONT SIZE=5><B>35</TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE></TD><TD WIDTH = 50％＆GT;＆LT; CENTER＆GT;＆LT; FONT类= T＆GT;＆LT; A HREF = TARGET ='_空白'＆GT'mypage.php REF = 1137988？';＆LT; I＆GT;
＆LT; / pre＆GT;

下面是我的code，到目前为止，这是我一直在努力实施

 从BS4进口BeautifulSoup
进口的urllib2
URL =http://wwww.somewebsite.com标题= {'的User-Agent：Mozilla的/ 5.0'}
HTML = urllib2.urlopen（urllib2.Request（URL，无头））。阅读（）
汤= BeautifulSoup（HTML）
链接= soup.find_all（a）的
在链接的链接：
打印A HREF = mypage.php？REF =％（link.get（A），link.text）打印链接

<醇开始=2>

我也想干脆把号码REF后，在列表中。我将放在这个code /

的数字部分

这意味着，我将在第一个列表中提取的数量我将不得不他们都用逗号分开，把里面替换= []

 模板=fjajflakjfakjfl; KJ REF = {}
sklkasalsjklas
klajsl; kdajs; djas
aksljl; askjflka
取代= 1131062，
    1140921，
1141326，
1141355，
1141426，
1141430，
1141461，
1141473，
1141477，
1141502]输出= [template.format（R）为R IN取代]
开放（'output.txt的'，'W'）为f_output：f_output.write（''。加入（[template.format（R）为R IN替换]））

所以，请与我想在这里做了两件事情有所帮助。对不起，如果格式是有点过。

非常感谢你。

如通过@wilbur建议
我修改了code这是我做的。

 从BS4进口BeautifulSoup
进口的urllib2
进口重URL =somewebsite标题= {'的User-Agent：Mozilla的/ 5.0'}
HTML = urllib2.urlopen（urllib2.Request（URL，无头））。阅读（）
汤= BeautifulSoup（HTML）链接= soup.findAll（'A'，HREF = re.compile（'。*我的空间\\ .PHP \\？REF = [0-9] *'））
模板=lasljasfkljaslkfj {}
slajfljasflk
aslkjfklasjflkasjf
alksjflkasjf;路
取代= [link.split（=）[1]在链接链接]输出= [template.format（R）为R IN取代]的打印输出
开放（'output.txt的'，'W'）为f_output：
    f_output.write（''。加入（[template.format（R）为R IN替换]））

解决方案

下面将抓住所有符合你的描述的链接，然后得到的每一个参数REF并把它们放到更换。

 从BS4进口BeautifulSoup
进口的urllib2
URL =http://wwww.somewebsite.com标题= {'的User-Agent：Mozilla的/ 5.0'}
HTML = urllib2.urlopen（urllib2.Request（URL，无头））。阅读（）
汤= BeautifulSoup（HTML）
链接= soup.findAll（'A'，HREF = re.compile（'。*我的空间\\ .PHP \\？REF = [0-9] *'））取代= [链接['href属性。分裂（=）[1]在链接链接]

1.a From the links present on the source code of a webpage i want to make a list of all links like "mypage.php?REF=1137988" which is mypage.php?REF= followed by a number

1.b. However this source page also contain somethings like Supp.Form.php?REF=1137988 which i wish to avoid.

</TD></TR>
</TABLE>
<FONT CLASS=t><TABLE cellspacing=5><TR><TD bgcolor='#FFFFA0' style='border:5px ridge lightgray'><TABLE cellspacing=4><TR><TD VALIGN=top><FONT CLASS=t2><CENTER>2015-09-03<BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFFF' style='border:4px ridge lightgray'><CENTER><FONT CLASS=t9>1137988 <A HREF='SuppForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/supp.gif' width=12 height=12 border=0 TITLE='delete'></A> <A HREF='ModifForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/modif.gif' width=10 height=11 border=0 TITLE='modify'></A><BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFA0' style='border:4px ridge lightgray'><TABLE><TR><TD><IMG SRC='faces/F.gif' width=36 border=0></TD><TD><CENTER><FONT SIZE=1>Age<BR></FONT><FONT SIZE=5><B>35</TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE></TD><TD WIDTH=50%><CENTER><FONT class=t><A HREF='mypage.php?REF=1137988' TARGET='_blank'><I>
</pre>

Here is my code so far, which i have been trying to implement

from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"

headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.find_all("a")
for link in links:
print "A HREF=mypage.php?REF=" %(link.get("a"), link.text)

print links

i also want to just put the number after REF in a list. which i will put in the numbers part of this code /

which means that the number that i will extract from the first list i will have to separate them all with a comma to put inside the replace = [ ]

template = """fjajflakjfakjfl;kj REF={}
sklkasalsjklas
klajsl;kdajs;djas
aksljl;askjflka
"""

replace = [1131062,
    1140921,
1141326,
1141355,
1141426,
1141430,
1141461,
1141473,
1141477,
1141502]

output = [template.format(r) for r in replace]
with open('output.txt', 'w') as f_output:

f_output.write(''.join([template.format(r) for r in replace]))

so please help with the two things that i wish to do here. sorry if the formatting is a bit off.

thank you very much.

as suggested by @wilbur i modified my code this is what i did

from bs4 import BeautifulSoup
import urllib2
import re

url = "somewebsite"

headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)

links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
template = """lasljasfkljaslkfj{}
slajfljasflk
aslkjfklasjflkasjf
alksjflkasjf;lk
"""

replace = [ link.split("=")[1] for link in links ]

output = [template.format(r) for r in replace]

print output
with open('output.txt', 'w') as f_output:
    f_output.write(''.join([template.format(r) for r in replace]))

解决方案

The following will grab all the links that match your description and then get the REF parameters from each and put them into replace.

from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"

headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))

replace = [ link['href'].split("=")[1] for link in links ]

这篇关于查找并列出使用Python网页特定链接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

查找并列出使用Python网页特定链接 [英] Find and list specific links in a webpage using Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

查找并列出使用Python网页特定链接 [英] Find and list specific links in a webpage using Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭