使用BeautifulSoup4中的循环修复编码错误? [英] Fix encoding error with loop in BeautifulSoup4?

查看:65
本文介绍了使用BeautifulSoup4中的循环修复编码错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是对关注特定结果,同时使用Python和Beautiful Soup 4抓取Twitter吗?我不使用Twitter API,因为它在很久以前都没有通过标签查看推文.

I'm not using the Twitter API because it doesn't look at the tweets by hashtag this far back.

此处描述的错误仅在Windows 7中发生.正如bernie所报告的那样,该代码在Linux上按预期运行,请参阅下面的注释,并且我能够在OSX 10.10.2上运行而不会出现编码错误.

The error described here only occurs in Windows 7. The code runs as intended on Linux, as reported by bernie, see comment below, and I am able to run it without encoding errors on OSX 10.10.2.

当我尝试循环抓取tweet内容的代码时,发生编码错误.

The encoding error occurs when I try to loop the code that scrapes the content of the tweet.

第一个代码段仅刮取第一条推文,并按预期方式将所有内容保存在<p>标记中.

This first snippet scrapes only the first tweet and gets everything in the <p> tags, as intended.

amessagetext = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})
amessage = amessagetext[0]

但是,当我尝试使用循环使用第二个片段抓取所有推文时,

However, when I attempt to use a loop to scrape all the tweets using this second snippet,

messagetexts = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})  
messages = [messagetext for messagetext in messagetexts] 

我遇到了这个众所周知的cp437.py编码错误.

I get this well known cp437.py encoding error.

File "C:\Anaconda3\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in     position 4052: character maps to <undefined>

那么为什么第一条Tweet的代码被成功抓取,却有多个Tweet导致编码问题呢?仅仅是因为第一条推文碰巧没有包含任何有问题的字符吗?我已经尝试成功在多个不同的搜索上刮取第一条推文,所以不确定是否是原因.

So why is it that the code for the first tweet is successfully scraped, but multiple tweets are causing encoding problems? Is it just because the first tweet happens to include no problematic characters? I've tried scraping the first tweet on several different searches successfully, so I'm not sure if that is the cause.

我该如何解决这个问题?我已经阅读了一些有关此内容的文章和书籍部分,并且我理解了为什么会发生这种情况,但是我不确定如何在BeautifulSoup代码中更正它.

How do I go about fixing this? I've read a few posts and book sections about this, and I understand why it happens, but I am not sure how to correct it within the BeautifulSoup code.

这里是完整的代码供参考.

Here is the complete code for reference.

from bs4 import BeautifulSoup
import requests
import sys
import csv #Will be exporting to csv

url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0'} # (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")

names = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
usernames = [name.contents[0] for name in names]

handles = soup('span', {'class': 'username js-action-profile-name'})
userhandles = [handle.contents[1].contents[0] for handle in handles]   
athandles = [('@')+abhandle for abhandle in userhandles]

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
fullurls = [('http://www.twitter.com')+permalink for permalink in urls] 

timestamps = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
datetime = [timestamp["title"] for timestamp in timestamps]

messagetexts = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})  
messages = [messagetext for messagetext in messagetexts] 

amessagetext = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})
amessage = amessagetext[0]

retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcounts = [retweet.contents[3].contents[1].contents[1].string for retweet in retweets]

favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcounts = [favorite.contents[3].contents[1].contents[1].string for favorite in favorites]

print (usernames, "\n", "\n", athandles, "\n", "\n", fullurls, "\n", "\n", datetime, "\n", "\n",retweetcounts, "\n", "\n", favcounts, "\n", "\n", amessage, "\n", "\n", messages)

推荐答案

我已经解决了这个问题,消除了我用于错误检查的打印语句,并为要抓取的HTML文件和csv指定了编码通过将encoding="utf-8"都添加到两个with open命令中来输出文件.

I've solved this to my own satisfaction by eliminating the print statements that I was using for error checking and specifying encoding for the HTML file being scraped and the csv output file by adding encoding="utf-8" to both with open commands.

from bs4 import BeautifulSoup
import requests
import sys
import csv
import re
from datetime import datetime
from pytz import timezone

url = input("Enter the name of the file to be scraped:")
with open(url, encoding="utf-8") as infile:
    soup = BeautifulSoup(infile, "html.parser")

#url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#r = requests.get(url, headers=headers)
#data = r.text.encode('utf-8')
#soup = BeautifulSoup(data, "html.parser")

names = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
usernames = [name.contents for name in names]

handles = soup('span', {'class': 'username js-action-profile-name'})
userhandles = [handle.contents[1].contents[0] for handle in handles]  
athandles = [('@')+abhandle for abhandle in userhandles]

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
fullurls = [permalink for permalink in urls]

timestamps = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
datetime = [timestamp["title"] for timestamp in timestamps]

messagetexts = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'}) 
messages = [messagetext for messagetext in messagetexts]  

retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcounts = [retweet.contents[3].contents[1].contents[1].string for retweet in retweets]

favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcounts = [favorite.contents[3].contents[1].contents[1].string for favorite in favorites]

images = soup('div', {'class': 'content'})
imagelinks = [src.contents[5].img if len(src.contents) > 5 else "No image" for src in images]

#print (usernames, "\n", "\n", athandles, "\n", "\n", fullurls, "\n", "\n", datetime, "\n", "\n",retweetcounts, "\n", "\n", favcounts, "\n", "\n", messages, "\n", "\n", imagelinks)

rows = zip(usernames,athandles,fullurls,datetime,retweetcounts,favcounts,messages,imagelinks)

rownew = list(rows)

#print (rownew)

newfile = input("Enter a filename for the table:") + ".csv"

with open(newfile, 'w', encoding='utf-8') as f:
    writer = csv.writer(f, delimiter=",")
    writer.writerow(['Usernames', 'Handles', 'Urls', 'Timestamp', 'Retweets', 'Favorites', 'Message', 'Image Link'])
    for row in rownew:
        writer.writerow(row)

这篇关于使用BeautifulSoup4中的循环修复编码错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆