从给定的网页中提取特定的列 [英] Extract specific columns from a given webpage

查看:100
本文介绍了从给定的网页中提取特定的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python阅读网页并将数据保存为csv格式,以作为pandas数据框导入.

I am trying to read web page using python and save the data in csv format to be imported as pandas dataframe.

我有以下代码从所有页面中提取链接,相反,我正在尝试读取某些列字段.

I have the following code that extracts the links from all the pages, instead I am trying to read certain column fields.

for i in range(10):
    url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
    import urllib2
    from bs4 import BeautifulSoup
    try:
        page = urllib2.urlopen(url).read()
        soup = BeautifulSoup(page)
        for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]: 
            print i, anchor.text
    except:
        pass

我可以将这9列另存为pandas数据框吗?

Can I save these 9 columns as pandas dataframe?

df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']

推荐答案

这会返回前10页的正确结果-但是100页会花费很多时间.有什么建议可以使其更快?

This returns the correct results for the first 10 pages - but it takes a lot of time for 100 pages. Any suggestions to make it faster?

import urllib2
from bs4 import BeautifulSoup

finallist=list()
for i in range(10):
    url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
    try:
        page = urllib2.urlopen(url).read()
        soup = BeautifulSoup(page)
        mylist=list()
        for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]: 
            mylist.append(anchor.text)
        finallist.append(mylist)
    except:
        pass

import pandas as pd
df=pd.DataFrame(finallist)

df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']

df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)
df['participants'] = df['participants'].astype(int)

这篇关于从给定的网页中提取特定的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆