查找并列出使用Python网页特定链接 [英] Find and list specific links in a webpage using Python

查看:169
本文介绍了查找并列出使用Python网页特定链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

1.A从网页上的我要让像所有链接的列表之源$ C ​​$ c中的链接presentmypage.php?REF = 1137988
这是mypage.php?REF =后跟一个数字

1.B。然而,这源页面还包含类似Supp.Form.php出头?REF = 1137988我想避免的。

 < / TD>< / TR>
< /表>
&LT; FONT CLASS = T&GT;&LT;表CELLSPACING = 5&GT;&LT; TR&GT;&LT; TD的bgcolor =#FFFFA0风格=边界:5像素脊浅灰色'&GT;&LT;表CELLSPACING = 4&GT;&LT; TR&GT;&LT ; TD VALIGN =顶部和GT;&LT; FONT CLASS = T2&GT;&LT; CENTER&GT; 2015年9月3日&LT; BR&GT;&LT;表CELLSPACING = 4&GT;&LT; TR&GT;&LT; TD的bgcolor =#FFFFFF风格=边界: 4PX脊浅灰色'&GT;&LT; CENTER&GT;&LT; FONT CLASS = T9&GT; 1137988&LT; A HREF ='SuppForm.php REF = 1137988'的目标='_空白?'&GT;&LT; IMG SRC ='终扣/ supp.gif WIDTH = 12 HEIGHT = 12 BORDER = 0 TITLE ='删除'&GT;&LT; / A&GT; &LT; A HREF ='ModifForm.php REF = 1137988?'目标='_空白'&GT;&LT; IMG SRC ='终扣/ modif.gifWIDTH = 10 HEIGHT = 11 BORDER = 0 TITLE ='修改'&GT;&LT ; / A&GT;&LT; BR&GT;&LT;表CELLSPACING = 4&GT;&LT; TR&GT;&LT; TD的bgcolor =#FFFFA0风格=边界:4PX脊浅灰色'&GT;&LT;表&gt;&LT; TR&GT;&LT; TD&GT ;&LT; IMG SRC ='面临/ F.gifWIDTH = 36 BORDER = 0&GT;&LT; / TD&GT;&LT; TD&GT;&LT; CENTER&GT;&LT; FONT SIZE = 1&GT;年龄&LT; BR&GT;&LT; / FONT&GT;&LT ; FONT SIZE=5><B>35</TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE></TD><TD WIDTH = 50%&GT;&LT; CENTER&GT;&LT; FONT类= T&GT;&LT; A HREF = TARGET ='_空白'&GT'mypage.php REF = 1137988?';&LT; I&GT;
&LT; / pre&GT;

下面是我的code,到目前为止,这是我一直在努力实施

 从BS4进口BeautifulSoup
进口的urllib2
URL =htt​​p://wwww.somewebsite.com标题= {'的User-Agent:Mozilla的/ 5.0'}
HTML = urllib2.urlopen(urllib2.Request(URL,无头))。阅读()
汤= BeautifulSoup(HTML)
链接= soup.find_all(a)的
在链接的链接:
打印A HREF = mypage.php?REF =%(link.get(A),link.text)打印链接

<醇开始=2>

  • 我也想干脆把号码REF后,在列表中。我将放在这个code /
  • 的数字部分
  • 这意味着,我将在第一个列表中提取的数量我将不得不他们都用逗号分开,把里面替换= []

     模板=fjajflakjfakjfl; KJ REF = {}
    sklkasalsjklas
    klajsl; kdajs; djas
    aksljl; askjflka
    取代= 1131062,
        1140921,
    1141326,
    1141355,
    1141426,
    1141430,
    1141461,
    1141473,
    1141477,
    1141502]输出= [template.format(R)为R IN取代]
    开放('output.txt的','W')为f_output:f_output.write(''。加入([template.format(R)为R IN替换]))


  • 所以,请与我想在这里做了两件事情有所帮助。对不起,如果格式是有点过。

    非常感谢你。

    如通过@wilbur建议
    我修改了code这是我做的。

     从BS4进口BeautifulSoup
    进口的urllib2
    进口重URL =somewebsite标题= {'的User-Agent:Mozilla的/ 5.0'}
    HTML = urllib2.urlopen(urllib2.Request(URL,无头))。阅读()
    汤= BeautifulSoup(HTML)链接= soup.findAll('A',HREF = re.compile('。*我的空间\\ .PHP \\?REF = [0-9] *'))
    模板=lasljasfkljaslkfj {}
    slajfljasflk
    aslkjfklasjflkasjf
    alksjflkasjf;路
    取代= [link.split(=)[1]在链接链接]输出= [template.format(R)为R IN取代]的打印输出
    开放('output.txt的','W')为f_output:
        f_output.write(''。加入([template.format(R)为R IN替换]))


    解决方案

    下面将抓住所有符合你的描述的链接,然后得到的每一个参数REF并把它们放到更换。

     从BS4进口BeautifulSoup
    进口的urllib2
    URL =htt​​p://wwww.somewebsite.com标题= {'的User-Agent:Mozilla的/ 5.0'}
    HTML = urllib2.urlopen(urllib2.Request(URL,无头))。阅读()
    汤= BeautifulSoup(HTML)
    链接= soup.findAll('A',HREF = re.compile('。*我的空间\\ .PHP \\?REF = [0-9] *'))取代= [链接['href属性。分裂(=)[1]在链接链接]

    1.a From the links present on the source code of a webpage i want to make a list of all links like "mypage.php?REF=1137988" which is mypage.php?REF= followed by a number

    1.b. However this source page also contain somethings like Supp.Form.php?REF=1137988 which i wish to avoid.

    </TD></TR>
    </TABLE>
    <FONT CLASS=t><TABLE cellspacing=5><TR><TD bgcolor='#FFFFA0' style='border:5px ridge lightgray'><TABLE cellspacing=4><TR><TD VALIGN=top><FONT CLASS=t2><CENTER>2015-09-03<BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFFF' style='border:4px ridge lightgray'><CENTER><FONT CLASS=t9>1137988 <A HREF='SuppForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/supp.gif' width=12 height=12 border=0 TITLE='delete'></A> <A HREF='ModifForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/modif.gif' width=10 height=11 border=0 TITLE='modify'></A><BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFA0' style='border:4px ridge lightgray'><TABLE><TR><TD><IMG SRC='faces/F.gif' width=36 border=0></TD><TD><CENTER><FONT SIZE=1>Age<BR></FONT><FONT SIZE=5><B>35</TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE></TD><TD WIDTH=50%><CENTER><FONT class=t><A HREF='mypage.php?REF=1137988' TARGET='_blank'><I>
    </pre>
    

    Here is my code so far, which i have been trying to implement

    from bs4 import BeautifulSoup
    import urllib2
    url = "http://wwww.somewebsite.com"
    
    headers = { 'User-Agent' : 'Mozilla/5.0' }
    html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
    soup = BeautifulSoup(html)
    links = soup.find_all("a")
    for link in links:
    print "A HREF=mypage.php?REF=" %(link.get("a"), link.text)
    
    print links
    

    1. i also want to just put the number after REF in a list. which i will put in the numbers part of this code /
    2. which means that the number that i will extract from the first list i will have to separate them all with a comma to put inside the replace = [ ]

      template = """fjajflakjfakjfl;kj REF={}
      sklkasalsjklas
      klajsl;kdajs;djas
      aksljl;askjflka
      """
      
      replace = [1131062,
          1140921,
      1141326,
      1141355,
      1141426,
      1141430,
      1141461,
      1141473,
      1141477,
      1141502]
      
      output = [template.format(r) for r in replace]
      with open('output.txt', 'w') as f_output:
      
      f_output.write(''.join([template.format(r) for r in replace]))
      

    so please help with the two things that i wish to do here. sorry if the formatting is a bit off.

    thank you very much.

    as suggested by @wilbur i modified my code this is what i did

    from bs4 import BeautifulSoup
    import urllib2
    import re
    
    url = "somewebsite"
    
    headers = { 'User-Agent' : 'Mozilla/5.0' }
    html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
    soup = BeautifulSoup(html)
    
    links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
    template = """lasljasfkljaslkfj{}
    slajfljasflk
    aslkjfklasjflkasjf
    alksjflkasjf;lk
    """
    
    replace = [ link.split("=")[1] for link in links ]
    
    output = [template.format(r) for r in replace]
    
    print output
    with open('output.txt', 'w') as f_output:
        f_output.write(''.join([template.format(r) for r in replace]))
    

    解决方案

    The following will grab all the links that match your description and then get the REF parameters from each and put them into replace.

    from bs4 import BeautifulSoup
    import urllib2
    url = "http://wwww.somewebsite.com"
    
    headers = { 'User-Agent' : 'Mozilla/5.0' }
    html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
    soup = BeautifulSoup(html)
    links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
    
    replace = [ link['href'].split("=")[1] for link in links ]
    

    这篇关于查找并列出使用Python网页特定链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆