有没有办法解析来自父网页的多个页面的数据? [英] Is there a way to parse data from multiple pages from a parent webpage?

查看:29
本文介绍了有没有办法解析来自父网页的多个页面的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我一直在一个网站上获取 NDC 代码

所以我点击带下划线的 NDC 代码.并得到这个网页.

因此,我将这 2 个 NDC 代码复制并粘贴到一个 Excel 表格中,然后对我展示的第一个网页上的其余代码重复该过程.但是这个过程需要很长时间,并且想知道 Python 中是否有一个库可以为我复制和粘贴 10 位 NDC 代码或将它们存储在列表中,然后我可以在完成后打印列表第一页上的所有 8 位 NDC 代码.BeautifulSoup 会起作用还是有更好的库来实现这个过程?

编辑<<<<<我实际上需要更深一层,我一直试图弄清楚,但我一直失败,显然网页的最后一层是这个愚蠢的 html 表格,我只需要表格的一个元素.这是您点击二级代码后的最后一个网页.

这是我拥有的代码,但是一旦我运行它,它就会返回一个 tr 和 None 对象.

url ='https://ndclist.com/?s=Trospium'汤 = BeautifulSoup(requests.get(url).content, 'html.parser')all_data = []for a in soup.select('[data-title="NDC"] a[href]'):link_url = a['href']print('Processin link {}...'.format(link_url))汤2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')对于soup2.select('#product-packages a')中的b:link_url2 = b['href']print('处理链接{}...'.format(link_url2))汤3 = BeautifulSoup(requests.get(link_url2).content, 'html.parser')对于soup3.findAll('tr', limit=7)[1] 中的链接:打印(链接.名称)all_data.append(link.name)打印('Tropium')打印(所有数据)

解决方案

是的,BeautifulSoup 是这种情况的理想选择.此脚本将打印页面中的所有 10 位代码:

导入请求从 bs4 导入 BeautifulSoupurl = 'https://ndclist.com/?s=Solifenacin'汤 = BeautifulSoup(requests.get(url).content, 'html.parser')all_data = []for a in soup.select('[data-title="NDC"] a[href]'):link_url = a['href']print('Processin link {}...'.format(link_url))汤2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')对于soup2.select('#product-packages a') 中的链接:打印(链接.文本)all_data.append(link.text)# 在 all_data 你有所有的代码,取消打印它们的注释:# 打印(all_data)

打印:

处理链接 https://ndclist.com/ndc/0093-5263...0093-5263-560093-5263-98处理链接 https://ndclist.com/ndc/0093-5264...0093-5264-560093-5264-98处理链接 https://ndclist.com/ndc/0591-3796...0591-3796-19处理链接 https://ndclist.com/ndc/27241-037...27241-037-0327241-037-09... 等等.

(我也得到描述的版本):

导入请求从 bs4 导入 BeautifulSoupurl = 'https://ndclist.com/?s=Solifenacin'汤 = BeautifulSoup(requests.get(url).content, 'html.parser')all_data = []for a in soup.select('[data-title="NDC"] a[href]'):link_url = a['href']print('Processin link {}...'.format(link_url))汤2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')对于代码,在 zip(soup2.select('a > h4'), Soup2.select('a + p.gi-1x')) 中描述:代码 = code.get_text(strip=True).split(maxsplit=1)[-1]desc = desc.get_text(strip=True).split(maxsplit=2)[-1]打印(代码,描述)all_data.append((code, desc))# 在 all_data 你有所有的代码:# 打印(all_data)

打印:

处理链接 https://ndclist.com/ndc/0093-5263...0093-5263-56 30 片,1 瓶薄膜包衣0093-5263-98 90 片,1 瓶薄膜包衣处理链接 https://ndclist.com/ndc/0093-5264...0093-5264-56 30 片,1 瓶薄膜包衣0093-5264-98 90 片,1 瓶薄膜包衣处理链接 https://ndclist.com/ndc/0591-3796...0591-3796-19 90 片,1 瓶薄膜包衣...等等.

So I have been going to a website to get NDC codes https://ndclist.com/?s=Solifenacin and I need to get 10 digit NDC codes, but on the current webpage there is only 8 digit NDC codes shown like this picture below

So I click on the underlined NDC code. And get this webpage.

So I copy and paste these 2 NDC codes to an excel sheet, and repeat the process for the rest of the codes on the first webpage I've shown. But this process takes a good bit of time, and was wondering if there was a library in Python that could copy and paste the 10 digit NDC codes for me or store them in a list and then I could print the list once I'm finished with all the 8 digit NDC codes on the first page. Would BeautifulSoup work or is there a better library to achieve this process?

EDIT <<<< I actually need to go another level deep and I've been trying to figure it out, but I've been failing, apparently the last level of webpage is this dumb html table, and I only need one element of the table. Here is the last webpage after you click on the 2nd level codes.

Here is the code that I have, but it's returning a tr and None object once I run it.

url ='https://ndclist.com/?s=Trospium'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
    link_url = a['href']
    print('Processin link {}...'.format(link_url))

    soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
    for b in soup2.select('#product-packages a'):
        link_url2 = b['href']
        print('Processing link {}... '.format(link_url2))
        soup3 = BeautifulSoup(requests.get(link_url2).content, 'html.parser')
        for link in soup3.findAll('tr', limit=7)[1]:
            print(link.name)
            all_data.append(link.name)

print('Trospium')
print(all_data)

解决方案

Yes, BeautifulSoup is ideal in this case. This script will print all 10 digits codes from the page:

import requests
from bs4 import BeautifulSoup

url = 'https://ndclist.com/?s=Solifenacin'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
    link_url = a['href']
    print('Processin link {}...'.format(link_url))

    soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
    for link in soup2.select('#product-packages a'):
        print(link.text)
        all_data.append(link.text)

# In all_data you have all codes, uncoment to print them:
# print(all_data)

Prints:

Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56
0093-5263-98
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56
0093-5264-98
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19
Processin link https://ndclist.com/ndc/27241-037...
27241-037-03
27241-037-09

... and so on.

EDIT: (Version where I get the description too):

import requests
from bs4 import BeautifulSoup

url = 'https://ndclist.com/?s=Solifenacin'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
    link_url = a['href']
    print('Processin link {}...'.format(link_url))

    soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
    for code, desc in zip(soup2.select('a > h4'), soup2.select('a + p.gi-1x')):
        code = code.get_text(strip=True).split(maxsplit=1)[-1]
        desc = desc.get_text(strip=True).split(maxsplit=2)[-1]
        print(code, desc)
        all_data.append((code, desc))

# in all_data you have all codes:
# print(all_data)

Prints:

Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5263-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5264-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19 90 TABLET, FILM COATED in 1 BOTTLE

...and so on.

这篇关于有没有办法解析来自父网页的多个页面的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆