网页抓取:收集信息后清空数据集 [英] Web-scraping: Empty dataset after collecting information

查看:68
本文介绍了网页抓取:收集信息后清空数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个数据集,其中包含从网站抓取的信息.我在下面解释我做了什么以及预期的输出.我正在为行和列获取空数组,然后为整个数据集获取空数组,但我不明白原因.我希望你能帮助我.

I would like to create a dataset that includes information scraped from a website. I explain what I have done and the expected output below. I am getting empty arrays for rows and columns, then for the whole dataset, and I do not understand the reason. I hope you can help me.

1)创建一个只有一个列的空数据框:此列应包含要使用的url列表.

1) Create an empty dataframe with only one column: this columns should contains a list of urls to use.

data_to_use = pd.DataFrame([], columns=['URL'])

2)从以前的数据集中选择网址.

2) Select urls from a previous dataset.

select_urls=dataset.URL.tolist()

这组网址看起来像:

                             URL
0                     www.bbc.co.uk
1             www.stackoverflow.com           
2                       www.who.int
3                       www.cnn.com
4         www.cooptrasportiriolo.it
...                             ...

3)使用以下网址填充列:

3) Populate the column with these urls:

data_to_use['URL']= select_urls
data_to_use['URLcleaned'] = data_to_use['URL'].str.replace('^(www\.)', '')

4)选择一个随机样本进行测试:列URL

4) Select a random sample to test: the first 50 rows in column URL

data_to_use = data_to_use.loc[1:50, 'URL']

5)尝试抓取信息

import requests
import time
from bs4 import BeautifulSoup

urls= data_to_use['URLcleaned'].tolist()

ares = []

for u in urls: # in the selection there should be an error. I am not sure that I am selecting the rig
    print(u)
    url = 'https://www.urlvoid.com/scan/'+ u
    r = requests.get(url)
    ares.append(r)   

rows = []
cols = []

for ar in ares:
    soup = BeautifulSoup(ar.content, 'lxml')
    tab = soup.select("table.table.table-custom.table-striped")   
    try:
            dat = tab[0].select('tr')
            line= []
            header=[]
            for d in dat:
                row = d.select('td')
                line.append(row[1].text)
            new_header = row[0].text
            if not new_header in cols:
                cols.append(new_header)
            rows.append(line)
    except IndexError:
        continue

print(rows) # this works fine. It prints the rows. The issue comes from the next line

data_to_use = pd.DataFrame(rows,columns=cols)  

不幸的是,上述步骤有问题,因为我没有得到任何结果,只有[]__.

Unfortunately there is something wrong in the steps above as I am not getting any results, but only [] or __.

来自data_to_use = pd.DataFrame(rows,columns=cols)的错误:

ValueError: 1 columns passed, passed data had 12 columns

我的预期输出将是:

URL          Website Address   Last Analysis   Blacklist Status \  
bbc.co.uk          Bbc.co.uk         9 days ago       0/35
stackoverflow.com Stackoverflow.com  7 days ago      0/35

Domain Registration               IP Address       Server Location    ...
996-08-01 | 24 years ago       151.101.64.81    (US) United States    ...
2003-12-26 | 17 years ago      ...

最后,我应该将创建的数据集保存在csv文件中.

At the end I should save the dataset created in a file csv.

推荐答案

不考虑转换为csv,让我们这样尝试:

Putting aside the conversion to csv, let's try it this way:

urls=['gov.ie', 'who.int', 'comune.staranzano.go.it', 'cooptrasportiriolo.it', 'laprovinciadicomo.it', 'asufc.sanita.fvg.it', 'canale7.tv', 'gradenigo.it', 'leggo.it', 'urbanpost.it', 'monitorimmobiliare.it', 'comune.villachiara.bs.it', 'ilcittadinomb.it', 'europamulticlub.com']
ares = []
for u in urls:
    url = 'https://www.urlvoid.com/scan/'+u
    r = requests.get(url)
    ares.append(r)

请注意,其中3个网址没有数据,因此您在数据框中应该只有11行. 下一个:

Note that 3 of the urls have no data, so you should have only 11 rows in the dataframe. Next:

rows = []
cols = []
for ar in ares:
    soup = bs(ar.content, 'lxml')
    tab = soup.select("table.table.table-custom.table-striped")        
    if len(tab)>0:
        dat = tab[0].select('tr')
        line= []
        header=[]
        for d in dat:
            row = d.select('td')
            line.append(row[1].text)
            new_header = row[0].text
            if not new_header in cols:
                cols.append(new_header)
        rows.append(line)

my_df = pd.DataFrame(rows,columns=cols)   
my_df.info()

输出:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 12 columns):
Website Address        11 non-null object
Last Analysis          11 non-null object
Blacklist Status       11 non-null object
Domain Registration    11 non-null object
Domain Information     11 non-null object
IP Address             11 non-null object
Reverse DNS            11 non-null object
ASN                    11 non-null object
Server Location        11 non-null object
Latitude\Longitude     11 non-null object
City                   11 non-null object
Region                 11 non-null object
dtypes: object(12)
memory usage: 1.2+ KB

这篇关于网页抓取:收集信息后清空数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆