通过抓取信息来创建新列 [英] Creating new columns by scraping information

查看:60
本文介绍了通过抓取信息来创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将从网站抓取的信息添加到列中.我有一个数据集,看起来像:

I am trying to add information scraped from a website into columns. I have a dataset that looks like:

COL1   COL2    COL3
...     ...    bbc.co.uk

,我想拥有一个包含新列的数据集:

and I would like to have a dataset which includes new columns:

 COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status \  
...     ...    bbc.co.uk

IP Address  Server Location    City       Region

这些新列来自以下网站: https://www.urlvoid. com/scan/bbc.co.uk . 我需要在每一列中填写相关信息.

These new columns come from the this website: https://www.urlvoid.com/scan/bbc.co.uk. I would need to fill each column with its related information.

例如:

  COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status \  
...     ...    bbc.co.uk         Bbc.co.uk         9 days ago       0/35

Domain Registration               IP Address       Server Location    City       Region
1996-08-01 | 24 years ago       151.101.64.81    (US) United States   Unknown    Unknown

不幸的是,在创建新列并用从网站上刮取的信息填充它们时,我遇到了一些问题.我可能要检查更多的网站,不仅是bbc.co.uk. 请参见下面使用的代码.我敢肯定,有一种更好的方法(而不是混乱的方法)可以做到这一点. 如果您能帮助我解决问题,我将不胜感激.谢谢

Unfortunately I am having some issue in creating new columns and filling them with the information scraped from the website. I might have more websites to check, not only bbc.co.uk. Please see below the code used. I am sure that there is a better (and less confused) approach to do that. I would really grateful if you could help me to figure it out. Thanks

如上面的示例所示,对于包含三列(col1, col2 and col3)的现有数据集,我还应该添加来自于抓取(Website Address,Last Analysis,Blacklist Status, ...)的字段.那么,对于每个网址,我都应该具有与之相关的信息(例如示例中的bbc.co.uk).

As shown in the example above, to the already existing dataset including the three columns (col1, col2 and col3) I should add also the fields that come from scraping (Website Address,Last Analysis,Blacklist Status, ... ). For each url, then, I should have information related to it (e.g. bbc.co.uk in the example).

 COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status \  
...     ...    bbc.co.uk          Bbc.co.uk         9 days ago       0/35
...     ...    stackoverflow.com
...     ...    ...


IP Address  Server Location    City       Region
  COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status \  
...     ...    bbc.co.uk         Bbc.co.uk         9 days ago       0/35
...     ...    stackoverflow.com Stackoverflow.com  7 days ago      0/35


Domain Registration               IP Address       Server Location    ...
996-08-01 | 24 years ago       151.101.64.81    (US) United States    ...
2003-12-26 | 17 years ago      ...

(格式不好,但我想它足以让您了解预期的输出).

(the format is not good, but I think it could be enough to let you have an idea of the expected output).

更新的代码:

urls= ['bbc.co.uk', 'stackoverflow.com', ...]

for x in urls:
        print(x)
        r = requests.get('https://www.urlvoid.com/scan/'+x)
        soup = BeautifulSoup(r.content, 'lxml')
        tab = soup.select("table.table.table-custom.table-striped")
        dat = tab[0].select('tr')
        for d in dat:
                row = d.select('td')
                original_dataset[row[0].text]=row[1].text

不幸的是,我做错了什么,因为它只复制新列下所有行中网站(即bbc.co.uk)上检查的第一个URL的信息.

Unfortunately there is something that I am doing wrong, as it is copying only the information from the first url checked on the website (i.e. bbc.co.uk) over all the rows under the new column.

推荐答案

让我知道这是否是您要寻找的东西:

Let me know if this is what you are looking for:

cols = ['Col1','Col2']
rows = ['something','something else']
my_df= pd.DataFrame(rows,index=cols).transpose()
my_df

从此行中提取您现有的代码:

Picking up you existing code from this line:

dat = tab[0].select('tr')

添加:

for d in dat:
    row = d.select('td')
    my_df[row[0].text]=row[1].text
my_df

输出(对不起格式):

    Col1       Col2       Website Address   Last Analysis   Blacklist Status    Domain Registration     Domain Information  IP Address  Reverse DNS     ASN     Server Location     Latitude\Longitude  City    Region
0   something   something else  Bbc.com     11 days ago  |  Rescan  0/35    1989-07-15 | 31 years ago   WHOIS Lookup | DNS Records | Ping   151.101.192.81   Find Websites  |  IPVoid  |  ...   Unknown     AS54113 FASTLY  (US) United States  37.751 / -97.822   Google Map   Unknown     Unknown

要使用多个网址,请尝试以下操作:

To do it with multiple urls, try something like this:

urls = ['bbc.com', 'stackoverflow.com']
ares = []
for u in urls:
    url = 'https://www.urlvoid.com/scan/'+u
    r = requests.get(url)
    ares.append(r)
rows = []
cols = []
for ar in ares:
    soup = bs(ar.content, 'lxml')
    tab = soup.select("table.table.table-custom.table-striped")        
    dat = tab[0].select('tr')
    line= []
    header=[]
    for d in dat:
        row = d.select('td')
        line.append(row[1].text)
        new_header = row[0].text
        if not new_header in cols:
            cols.append(new_header)

    rows.append(line)

my_df = pd.DataFrame(rows,columns=cols)   
my_df

输出:

Website Address     Last Analysis   Blacklist Status    Domain Registration     Domain Information  IP Address  Reverse DNS     ASN     Server Location     Latitude\Longitude  City    Region
0   Bbc.com     12 days ago  |  Rescan  0/35    1989-07-15 | 31 years ago   WHOIS Lookup | DNS Records | Ping   151.101.192.81   Find Websites  |  IPVoid  |  ...   Unknown     AS54113 FASTLY  (US) United States  37.751 / -97.822   Google Map   Unknown     Unknown
1   Stackoverflow.com   5 minutes ago  |  Rescan    0/35    2003-12-26 | 17 years ago   WHOIS Lookup | DNS Records | Ping   151.101.1.69   Find Websites  |  IPVoid  |  Whois   Unknown     AS54113 FASTLY  (US) United States  37.751 / -97.822   Google Map   Unknown     Unknown

请注意,这没有您现有的两个列(因为我不知道它们是什么),因此您必须将它们分别附加到数据框.

Note that this doesn't have your two existing columns (since I don't know what they are), so you'll have to append them separately to the dataframe.

这篇关于通过抓取信息来创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆