查找并列出使用Python网页特定链接 [英] Find and list specific links in a webpage using Python
问题描述
1.A从网页上的我要让像所有链接的列表之源$ C $ c中的链接presentmypage.php?REF = 1137988
这是mypage.php?REF =后跟一个数字
1.B。然而,这源页面还包含类似Supp.Form.php出头?REF = 1137988我想避免的。
< / TD>< / TR>
< /表>
&LT; FONT CLASS = T&GT;&LT;表CELLSPACING = 5&GT;&LT; TR&GT;&LT; TD的bgcolor =#FFFFA0风格=边界:5像素脊浅灰色'&GT;&LT;表CELLSPACING = 4&GT;&LT; TR&GT;&LT ; TD VALIGN =顶部和GT;&LT; FONT CLASS = T2&GT;&LT; CENTER&GT; 2015年9月3日&LT; BR&GT;&LT;表CELLSPACING = 4&GT;&LT; TR&GT;&LT; TD的bgcolor =#FFFFFF风格=边界: 4PX脊浅灰色'&GT;&LT; CENTER&GT;&LT; FONT CLASS = T9&GT; 1137988&LT; A HREF ='SuppForm.php REF = 1137988'的目标='_空白?'&GT;&LT; IMG SRC ='终扣/ supp.gif WIDTH = 12 HEIGHT = 12 BORDER = 0 TITLE ='删除'&GT;&LT; / A&GT; &LT; A HREF ='ModifForm.php REF = 1137988?'目标='_空白'&GT;&LT; IMG SRC ='终扣/ modif.gifWIDTH = 10 HEIGHT = 11 BORDER = 0 TITLE ='修改'&GT;&LT ; / A&GT;&LT; BR&GT;&LT;表CELLSPACING = 4&GT;&LT; TR&GT;&LT; TD的bgcolor =#FFFFA0风格=边界:4PX脊浅灰色'&GT;&LT;表&gt;&LT; TR&GT;&LT; TD&GT ;&LT; IMG SRC ='面临/ F.gifWIDTH = 36 BORDER = 0&GT;&LT; / TD&GT;&LT; TD&GT;&LT; CENTER&GT;&LT; FONT SIZE = 1&GT;年龄&LT; BR&GT;&LT; / FONT&GT;&LT ; FONT SIZE=5><B>35</TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE></TD><TD WIDTH = 50%&GT;&LT; CENTER&GT;&LT; FONT类= T&GT;&LT; A HREF = TARGET ='_空白'&GT'mypage.php REF = 1137988?';&LT; I&GT;
&LT; / pre&GT;
下面是我的code,到目前为止,这是我一直在努力实施
从BS4进口BeautifulSoup
进口的urllib2
URL =http://wwww.somewebsite.com标题= {'的User-Agent:Mozilla的/ 5.0'}
HTML = urllib2.urlopen(urllib2.Request(URL,无头))。阅读()
汤= BeautifulSoup(HTML)
链接= soup.find_all(a)的
在链接的链接:
打印A HREF = mypage.php?REF =%(link.get(A),link.text)打印链接
<醇开始=2>
这意味着,我将在第一个列表中提取的数量我将不得不他们都用逗号分开,把里面替换= []
模板=fjajflakjfakjfl; KJ REF = {}
sklkasalsjklas
klajsl; kdajs; djas
aksljl; askjflka
取代= 1131062,
1140921,
1141326,
1141355,
1141426,
1141430,
1141461,
1141473,
1141477,
1141502]输出= [template.format(R)为R IN取代]
开放('output.txt的','W')为f_output:f_output.write(''。加入([template.format(R)为R IN替换]))
所以,请与我想在这里做了两件事情有所帮助。对不起,如果格式是有点过。
非常感谢你。
如通过@wilbur建议
我修改了code这是我做的。
从BS4进口BeautifulSoup
进口的urllib2
进口重URL =somewebsite标题= {'的User-Agent:Mozilla的/ 5.0'}
HTML = urllib2.urlopen(urllib2.Request(URL,无头))。阅读()
汤= BeautifulSoup(HTML)链接= soup.findAll('A',HREF = re.compile('。*我的空间\\ .PHP \\?REF = [0-9] *'))
模板=lasljasfkljaslkfj {}
slajfljasflk
aslkjfklasjflkasjf
alksjflkasjf;路
取代= [link.split(=)[1]在链接链接]输出= [template.format(R)为R IN取代]的打印输出
开放('output.txt的','W')为f_output:
f_output.write(''。加入([template.format(R)为R IN替换]))
下面将抓住所有符合你的描述的链接,然后得到的每一个参数REF并把它们放到更换。
从BS4进口BeautifulSoup
进口的urllib2
URL =http://wwww.somewebsite.com标题= {'的User-Agent:Mozilla的/ 5.0'}
HTML = urllib2.urlopen(urllib2.Request(URL,无头))。阅读()
汤= BeautifulSoup(HTML)
链接= soup.findAll('A',HREF = re.compile('。*我的空间\\ .PHP \\?REF = [0-9] *'))取代= [链接['href属性。分裂(=)[1]在链接链接]
1.a From the links present on the source code of a webpage i want to make a list of all links like "mypage.php?REF=1137988" which is mypage.php?REF= followed by a number
1.b. However this source page also contain somethings like Supp.Form.php?REF=1137988 which i wish to avoid.
</TD></TR>
</TABLE>
<FONT CLASS=t><TABLE cellspacing=5><TR><TD bgcolor='#FFFFA0' style='border:5px ridge lightgray'><TABLE cellspacing=4><TR><TD VALIGN=top><FONT CLASS=t2><CENTER>2015-09-03<BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFFF' style='border:4px ridge lightgray'><CENTER><FONT CLASS=t9>1137988 <A HREF='SuppForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/supp.gif' width=12 height=12 border=0 TITLE='delete'></A> <A HREF='ModifForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/modif.gif' width=10 height=11 border=0 TITLE='modify'></A><BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFA0' style='border:4px ridge lightgray'><TABLE><TR><TD><IMG SRC='faces/F.gif' width=36 border=0></TD><TD><CENTER><FONT SIZE=1>Age<BR></FONT><FONT SIZE=5><B>35</TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE></TD><TD WIDTH=50%><CENTER><FONT class=t><A HREF='mypage.php?REF=1137988' TARGET='_blank'><I>
</pre>
Here is my code so far, which i have been trying to implement
from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.find_all("a")
for link in links:
print "A HREF=mypage.php?REF=" %(link.get("a"), link.text)
print links
- i also want to just put the number after REF in a list. which i will put in the numbers part of this code /
which means that the number that i will extract from the first list i will have to separate them all with a comma to put inside the replace = [ ]
template = """fjajflakjfakjfl;kj REF={} sklkasalsjklas klajsl;kdajs;djas aksljl;askjflka """ replace = [1131062, 1140921, 1141326, 1141355, 1141426, 1141430, 1141461, 1141473, 1141477, 1141502] output = [template.format(r) for r in replace] with open('output.txt', 'w') as f_output: f_output.write(''.join([template.format(r) for r in replace]))
so please help with the two things that i wish to do here. sorry if the formatting is a bit off.
thank you very much.
as suggested by @wilbur i modified my code this is what i did
from bs4 import BeautifulSoup
import urllib2
import re
url = "somewebsite"
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
template = """lasljasfkljaslkfj{}
slajfljasflk
aslkjfklasjflkasjf
alksjflkasjf;lk
"""
replace = [ link.split("=")[1] for link in links ]
output = [template.format(r) for r in replace]
print output
with open('output.txt', 'w') as f_output:
f_output.write(''.join([template.format(r) for r in replace]))
The following will grab all the links that match your description and then get the REF parameters from each and put them into replace.
from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
replace = [ link['href'].split("=")[1] for link in links ]
这篇关于查找并列出使用Python网页特定链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!