Python的副网址的ID和网址在名单冠军 [英] Python associate urls's ids and url's titles in lists

查看:191
本文介绍了Python的副网址的ID和网址在名单冠军的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题continution:<一href=\"http://stackoverflow.com/questions/23665526/python-beautifulsoup-how-to-get-the-line-after-href/23665984#23665984\">Python beautifulsoup如何获得HREF之后的行

我有这样的HTML code

 &LT; A HREF =htt​​p://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html级=SS-滴度&GT;
                            蒙乐子&LT; / A&GT;
    &LT; D​​IV CLASS =RS-细胞细节&GT;
                            &所述; A HREF =htt​​p://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html类=β-滴度&GT;
                                    Rubin_Steine​​r&下; / A&GT;
&所述; A HREF =htt​​p://pluzz.francetv.fr/videos/fare_maohi_,102103928.html类=β-滴度&GT;
                        票价maohi&LT; / A&GT;

正如你所见,蒙乐子和Rubin_Steine​​r'联想是用相同的ID(101973832)和票价maohi​​是联想与ID 102103928。

所以,其实我有这些列表(有一个结果,一个识别码为例):

  URL = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html','http://pluzz.francetv.fr/videos/fare_maohi_, 102103928.html']
标题= ['蒙乐子','Rubin_Steine​​r,票价maohi']#2项ID 101973832
                                                           #1条目ID 102103928

标题可能有3项,或1,或无...

我如何的Id的ADRESS(101973832)和职称相关联,要得到这样的结果:

 结果= ['蒙乐子Rubin_Steine​​r 101973832,票价maohi 102103928']

结果将被用于在我的Gtk界面中显示。它需要包含ID查找这样的对应网址:

 选择= self.liste.get_active_text()#=选择的结果
在URL地址:
        如果ADRESS ID:
            住址= URL

我希望我的问题不是太难理解...

编辑:
我拿到冠军,并且URL是这样的:

  URL =htt​​p://pluzz.francetv.fr/recherche?recherche=+ MOT#MOT是我的Gtk搜索词
尝试:
   F =了urllib.urlopen(URL)
   页= f.read()
   f.close()
除:
   self.champ.set_text(香格里拉RECHERCHE一个échoué)
   通过
汤= BeautifulSoup(页)
滴度= []
list_url = []
在soup.findAll('A')链接:
     留置权= link.get('href属性)
     如果留置权==无:
         留置权=
     如果http://pluzz.francetv.fr/videos/在留置权:
         滴度=(link.text.strip())
         如果案CETTE视频滴度:
              滴度=
         如果滴度里拉LA视频:
              滴度=
         titres.append(效价)
         list_url.append(留置权)


解决方案

如果我理解正确,你和你的网址和标题将在喜欢你的例子清单。

 进口重在[111]:标题= ['蒙乐子','Rubin_Steine​​r']在[112]:URL = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html']在[113]:get_id = get_id =通过re.findall('\\ D +',网址[0])#找到连续数字在[114]:结果= [X在标题X] + get_id在[115]:结果
出[115]:['蒙乐子','Rubin_Steine​​r','101973832']

我在我的意见,说,当你在子列表添加标题为您的标题列表,相应的字幕组,它是不可能告诉其所属的无索引分组的一些方法。我曾经归纳他们的子列表向你展示它是如何工作的。

 在[3]:URL = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html','http://pluzz.francetv.fr /videos/fare_maohi_,102103928.html']在[4]:标题= [['蒙乐子','Rubin_Steine​​r'],['票价maohi']]#需要子列表以匹配给url位置[5]:get_ids = [通过re.findall('\\ D +',X)在URL X]#获取所有的id在列表中的位置将匹配标题子列表中的位置[6]:结果= [T +我为T,I拉链(标题,get_ids)]#这就是为什么子列表是有用的,子列表的每个位置相对应。[7]:结果出[7]:[[蒙乐子','Rubin_Steine​​r','101973832'],['票价maohi','102103928']]在[11]:final_results = [。加入(y)的y的在结果中][12]:final_results出[12]:['蒙乐子Rubin_Steine​​r 101973832,票价maohi 102103928']#中的每个子列表加入串

continution of this question: Python beautifulsoup how to get the line after 'href'

I have this HTML code

    <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html" class="ss-titre"> 
                            Monte le son         </a>
    <div class="rs-cell-details">
                            <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"  class="ss-titre">
                                    "Rubin_Steiner"                 </a>
<a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html" class="ss-titre"> 
                        Fare maohi              </a>

As you see, "Monte le son" and ' "Rubin_Steiner" ' are associate with the same id (101973832) and "Fare maohi" is associate with the id 102103928.

So, actually I have these lists (example with one result, one id):

url = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html', 'http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html']      
titles = ['Monte le son', 'Rubin_Steiner', 'Fare maohi']   #2 entries for id 101973832
                                                           #1 entry for id 102103928

Titles could have 3 entries, or 1, or none...

How can I associate the Id of the adress (101973832) and the titles, to get this result:

result = ['"Monte le son Rubin_Steiner 101973832"', 'Fare maohi 102103928']

The result will be used to display in my Gtk interface. It need to contain the id to find the corresponding url like this:

choice = self.liste.get_active_text()     # choice = result   
for adress in url:
        if id in adress: 
            adresse = url

I hope my question is not too difficult to understand...

Edit: I get the title and the urls like this:

url = "http://pluzz.francetv.fr/recherche?recherche=" + mot # mot is a word for my Gtk search
try:
   f = urllib.urlopen(url)
   page = f.read()
   f.close()
except: 
   self.champ.set_text("La recherche a échoué")
   pass    
soup = BeautifulSoup(page)
titres=[]
list_url=[]
for link in soup.findAll('a'):
     lien = link.get('href')
     if lien == None:
         lien = ""
     if "http://pluzz.francetv.fr/videos/" in lien:
         titre = (link.text.strip())
         if "Voir cette  vidéo" in titre:
              titre = ""
         if "Lire la vidéo" in titre:
              titre = ""
         titres.append(titre)
         list_url.append(lien)

解决方案

If I understand you correctly and all your urls and titles will be in a list like your example.

import re

In [111]: titles = ['Monte le son', 'Rubin_Steiner']

In [112]: url = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html']

In [113]: get_id = get_id = re.findall('\d+', url[0]) # find consecutive digits

In [114]: results = [x for x in titles] + get_id

In [115]: results
Out[115]: ['Monte le son', 'Rubin_Steiner', '101973832']

As I say in my comments, when you add titles to your titles list, group corresponding titles in sublists, it is impossible to tell which belongs where without some way of indexing the groupings. I have grouped them in sublists to show you how it works.

In [3]: url = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html',   'http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html']

In [4]: titles = [['Monte le son', 'Rubin_Steiner'], ['Fare maohi']]   # need to sub list to match to url position

In [5]: get_ids = [re.findall('\d+', x) for x in url] # get all ids, position in list will match sub list position in titles

In [6]: results= [t + i for t, i in zip(titles, get_ids)] # this is why sub lists are useful, each position of the sub lists correspond.

In [7]: results

Out[7]: [['Monte le son', 'Rubin_Steiner', '101973832'], ['Fare maohi', '102103928']]

In [11]: final_results=[ " ".join(y) for y in  results ]

In [12]: final_results

Out[12]: ['Monte le son Rubin_Steiner 101973832', 'Fare maohi 102103928'] # join strings in each sublist

这篇关于Python的副网址的ID和网址在名单冠军的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆