使用漂亮的汤抓网 [英] Web scrapping using beautiful soup

查看:71
本文介绍了使用漂亮的汤抓网的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何获得同一网站">://www.sfma.org.sg/member/category ".例如,当我在上述页面上选择酒精饮料类别时,该页面上提及的列表具有以下类别信息:-

How could i get all the categories mentioned on each listing page of the same website "https://www.sfma.org.sg/member/category". for example, when i choose alcoholic beverage category on the above mentioned page, the listings mentioned on that page has the category information like this :-

Catergory: Alcoholic Beverage, Bottled Beverage, Spirit / Liquor / Hard Liquor, Wine, Distributor, Exporter, Importer, Supplier

如何使用相同的变量提取此处提到的类别.

how can i extract the categories mentioned here with in same variable.

我为此编写的代码是:-

The code i have written for this is :-

  category = soup_2.find_all('a', attrs ={'class' :'plink'})
  links = [links['href'] for links in category]

,但它会产生以下输出,这些输出是&页面上的所有链接.而不是href:-

but it is producing the below output which are all the links on the page & not the text with in the href:-

['http://www.sfma.org.sg/about/singapore-food-manufacturers-association',
 'http://www.sfma.org.sg/about/council-members',
 'http://www.sfma.org.sg/about/history-and-milestones',
 'http://www.sfma.org.sg/membership/',
 'http://www.sfma.org.sg/member/',
 'http://www.sfma.org.sg/member/alphabet/',
 'http://www.sfma.org.sg/member/category/',
 'http://www.sfma.org.sg/resources/sme-portal',
 'http://www.sfma.org.sg/resources/setting-up-food-establishments-in-singapore',
 'http://www.sfma.org.sg/resources/import-export-requirements-and-procedures',
 'http://www.sfma.org.sg/resources/labelling-guidelines',
 'http://www.sfma.org.sg/resources/wsq-continuing-education-modular-programmes',
 'http://www.sfma.org.sg/resources/holistic-industry-productivity-scorecard',
 'http://www.sfma.org.sg/resources/p-max',
 'http://www.sfma.org.sg/event/',
  .....]

如果这个问题似乎是新手,请原谅,我对python还是很陌生,

Please excuse if the question seems to be novice, i am just very new to python,

谢谢!!!

推荐答案

如果您只想从已发布的结果中删除链接,则可以这样获得:

If you just want the links out of the results you already posted, you can get that like this:

import requests 
from bs4 import BeautifulSoup

page = "https://www.sfma.org.sg/member/category/manufacturer"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')
links = soup.find_all('a', attrs ={'class' :'plink'})
for link in links:
    print(link['href'])

输出:

../info/{{permalink}}
http://www.sfma.org.sg/about/singapore-food-manufacturers-association
http://www.sfma.org.sg/about/council-members
http://www.sfma.org.sg/about/history-and-milestones
http://www.sfma.org.sg/membership/
http://www.sfma.org.sg/member/
http://www.sfma.org.sg/member/alphabet/
http://www.sfma.org.sg/member/category/
http://www.sfma.org.sg/resources/sme-portal
http://www.sfma.org.sg/resources/setting-up-food-establishments-in-singapore
http://www.sfma.org.sg/resources/import-export-requirements-and-procedures
http://www.sfma.org.sg/resources/labelling-guidelines
http://www.sfma.org.sg/resources/wsq-continuing-education-modular-programmes
http://www.sfma.org.sg/resources/holistic-industry-productivity-scorecard
http://www.sfma.org.sg/resources/p-max
http://www.sfma.org.sg/event/
http://www.sfma.org.sg/news/
http://www.fipa.com.sg/
http://www.sfma.org.sg/stp
http://www.sgfoodgifts.sg/

但是,如果您想要链接到网站上每个条目的链接,则需要将永久链接值与基本URL结合在一起.我已经从nag扩展了该答案,以帮助从您正在查看的网站获取所需的数据.第二个列表中显示了永久链接值,这些值不起作用(食品/饮料类型,而不是公司),因此我将其删除.

However, if you want the links to each of the entries on the website, you need to join the permalink values with the base url. I've extended that answer from nag to help get the data you want from the website you are looking at. There are permalink values that appear in a second list, and don't work (food/beverage types, rather than companies) so I'm removing them.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re


page = "https://www.sfma.org.sg/member/category/manufacturer"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')

url_list = []

script_sections = soup.find_all('script')
for i in range(len(script_sections)):
    if len(script_sections[i].contents) >= 1:
        txt = script_sections[i].contents[0]
        pattern = re.compile(r'permalink:\'(.*?)\'')
        permlinks = re.findall(pattern, txt)
        for i in permlinks:
            href = "../info/{{permalink}}"
            href = href.split('{')[0]+i
            full_url = urljoin(page, href)
            if full_url in url_list:
                # drop the repeat extras?
                url_list.remove(full_url)
            else:
                url_list.append(full_url)

for urls in url_list:
    print(urls)

输出(被截断):

https://www.sfma.org.sg/member/info/1a-catering-pte-ltd
https://www.sfma.org.sg/member/info/a-linkz-marketing-pte-ltd
https://www.sfma.org.sg/member/info/aalst-chocolate-pte-ltd
https://www.sfma.org.sg/member/info/abb-pte-ltd
https://www.sfma.org.sg/member/info/ace-synergy-international-pte-ltd
https://www.sfma.org.sg/member/info/acez-instruments-pte-ltd
https://www.sfma.org.sg/member/info/acorn-investments-holding-pte-ltd
https://www.sfma.org.sg/member/info/ad-wright-communications-pte-ltd
https://www.sfma.org.sg/member/info/added-international-s-pte-ltd
https://www.sfma.org.sg/member/info/advance-carton-pte-ltd
https://www.sfma.org.sg/member/info/agroegg-pte-ltd
https://www.sfma.org.sg/member/info/airverclean-pte-ltd
...

这篇关于使用漂亮的汤抓网的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆