用bs4正确提取数据? [英] Extracting properly data with bs4?

查看:217
本文介绍了用bs4正确提取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我在网站上的第一个问题,因为我已经尝试了很多方法来获得我想要的东西,但是我没有成功..
我尝试从类似于CraigList的法国网站中提取两种类型的数据。
我的需求很简单,我设法获取这些信息,但是我的提取中仍然有标签和其他标志。即使使用.encode(utf-8),我也有编码问题。

  -  *  - 编码:utf-8  -  *  -  
从urllib.request导入urlopen
从bs4 import美丽的S $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ = open(test.csv,'w +')

html = urlopen(http://www.leboncoin.fr/annonces/offres/ile_de_france/)

bsObj = BeautifulSoup(html)
article = bsObj.findAll(h2,{class:title})
prix = bsObj.findAll(div,{class :price})

文章中的艺术:
art = art.text.encode('utf-8')

print(article)

prix1在prix中:
prix1 = prix1.text.encode('utf-8')
print(prix1)

#Pour merger 2列表(en deux colonnes,pas a la suite)
table_2 = list(zip(article,prix))

try:
writer = csv.writer(csvfile)
writer.writerow((文章,Prix))

在table_2中的$:
writer.writerow([i])


csvfile.close()

运行此代码时:




  • 我的输出包含等。虽然我已经运行:




art = art.text.encode('utf-8')




  • 有时,由于 €$$$
    $ b

    我的问题是:




    • 为什么.text.encode()不会清除我的文章对象中的标签?

    • 为什么还会收到问题编码?



    我想我没有按预期使用该功能,但尽管我的测试我没有得到结果..



    提前感谢您的见解。



    干杯



    Jo

    解决方案

    你可能已经意识到你的错误。您正在压缩并输出 NavigationElements 而不是元素文本。我修改了以下代码:

     # -  *  - 编码:utf-8  -  *  -  
    from urllib2 import urlopen
    from bs4 import BeautifulSoup
    import re
    import csv

    csvfile = open(test.csv,'w +')

    html = urlopen(http://www.leboncoin.fr/annonces/offres/ile_de_france/)

    bsObj = BeautifulSoup(html,html.parser)
    article = bsObj.findAll(h2,{class:title})
    prix = bsObj.findAll(div,{class:price})

    articles = []
    文章中的艺术:
    articles.append(art.text.encode('utf-8')。strip())

    print(art)

    price = []
    prix1在prix中:
    price.append(prix1.text.encode('utf-8')。strip )

    #Pour merger 2 listes(en deux colonnes,pas a la suite)
    table_2 = list(zip(articles,prices))

    try:
    writer = csv.writer(csvfile)
    writer.writerow(('文章','Prix'))

    在我的table_2中:

    writer.writerow([i])

    finally:
    csvfile.close()

    同时尽量不要在法语中留下评论;)


    Here is my first question on this site as I have tried many ways to get what I want but I didnt succeed.. I try to extract 2 types of data from a french website similar to CraigList. My need is simple and I manage to get those information but I still have tags and other signs in my extract. I also have issue with encoding even if using .encode(utf-8).

    # -*- coding: utf-8 -*-
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re
    import csv
    
    csvfile=open("test.csv", 'w+')
    
    html=urlopen("http://www.leboncoin.fr/annonces/offres/ile_de_france/")
    
    bsObj=BeautifulSoup(html)
    article= bsObj.findAll("h2",{"class":"title"})
    prix=bsObj.findAll("div",{"class":"price"})
    
    for art in article:
        art=art.text.encode('utf-8')
    
    print(article)
    
    for prix1 in prix:
        prix1=prix1.text.encode('utf-8')
        print(prix1)
    
    #Pour merger 2 listes (en deux colonnes, pas a la suite)
    table_2=list(zip(article,prix))
    
    try:
        writer=csv.writer(csvfile)
        writer.writerow(('Article', 'Prix'))
    
        for i in table_2:
            writer.writerow([i])
    
    finally:
        csvfile.close()
    

    When running this code:

    • My output contains , etc.. although I have run:

    for art in article: art=art.text.encode('utf-8')

    • Sometimes the encoding does not work due to "€" or "-" signs in the name o the product

    My questions are:

    • Why do the ".text.encode()" does not clean the tags from my article object?
    • Why do I still get issue with encoding?

    I guess I am not using the function as expected but despite my tests I do not get to the result..

    Thank you in advance for your insights.

    Cheers

    Jo

    解决方案

    You might have already realised your mistake. You were zipping and outputting the NavigationElements and not the element texts. I corrected your code below:

    # -*- coding: utf-8 -*-
    from urllib2 import urlopen
    from bs4 import BeautifulSoup
    import re
    import csv
    
    csvfile=open("test.csv", 'w+')
    
    html=urlopen("http://www.leboncoin.fr/annonces/offres/ile_de_france/")
    
    bsObj=BeautifulSoup(html, "html.parser")
    article= bsObj.findAll("h2",{"class":"title"})
    prix=bsObj.findAll("div",{"class":"price"})
    
    articles = []
    for art in article:
        articles.append(art.text.encode('utf-8').strip())
    
    print(art)
    
    prices = []
    for prix1 in prix:
        prices.append(prix1.text.encode('utf-8').strip())
    
    #Pour merger 2 listes (en deux colonnes, pas a la suite)
    table_2=list(zip(articles,prices))
    
    try:
        writer=csv.writer(csvfile)
        writer.writerow(('Article', 'Prix'))
    
        for i in table_2:
    
            writer.writerow([i])
    
    finally:
        csvfile.close()
    

    Also try not to leave comments in French next time ;)

    这篇关于用bs4正确提取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆