BeautifulSoup-刮论坛页面 [英] BeautifulSoup - scraping a forum page

查看:48
本文介绍了BeautifulSoup-刮论坛页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取论坛讨论并将其导出为csv文件,其中包含线程标题",用户"和帖子"等行,其中后者是每个人的实际论坛帖子.

I'm trying to scrape a forum discussion and export it as a csv file, with rows such as "thread title", "user", and "post", where the latter is the actual forum post from each individual.

我是使用Python和BeautifulSoup的完整入门者,所以我为此很难过!

I'm a complete beginner with Python and BeautifulSoup so I'm having a really hard time with this!

我当前的问题是csv文件中的所有文本每行都被分割成一个字符.那里有人可以帮助我吗?如果有人能帮我一个忙,那就太好了!

My current problem is that all the text is split into one character per row in the csv file. Is there anyone out there who can help me out? It would be fantastic if someone could give me a hand!

这是我一直在使用的代码:

Here's the code I've been using:

from bs4 import BeautifulSoup
import csv
import urllib2

f = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")

soup = BeautifulSoup(f)

b = soup.get_text().encode("utf-8").strip() #the posts contain non-ascii words, so I had to do this

writer = csv.writer(open('silkroad.csv', 'w'))
writer.writerows(b)

推荐答案

好的,我们开始吧.不太确定我会在这里为您提供什么帮助,但是希望您有充分的理由分析丝绸之路的帖子.

Ok here we go. Not quite sure what I'm helping you do here, but hopefully you have a good reason to be analyzing silk road posts.

这里有几个问题,最大的问题是您根本没有解析数据.您基本上要使用.get_text()进行操作,转到页面,突出显示整个内容,然后将整个内容复制并粘贴到csv文件中.

You have a few issues here, the big one is that you aren't parsing the data at all. What you're essentially doing with .get_text() is going to the page, highlighting the whole thing, and then copying and pasting the whole thing to a csv file.

这就是您应该尝试做的事情:

So here is what you should be trying to do:

  1. 阅读页面源代码
  2. 用汤将其分成所需的部分
  3. 将部分保存在并行数组中以保存作者,日期,时间,帖子等
  4. 逐行将数据写入csv文件

我写了一些代码向您展示看起来像什么,它应该可以完成工作:

I wrote some code to show you what that looks like, it should do the job:

from bs4 import BeautifulSoup
import csv
import urllib2

# get page source and create a BeautifulSoup object based on it
print "Reading page..."
page = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")
soup = BeautifulSoup(page)

# if you look at the HTML all the titles, dates, 
# and authors are stored inside of <dt ...> tags
metaData = soup.find_all("dt")

# likewise the post data is stored
# under <dd ...>
postData = soup.find_all("dd")

# define where we will store info
titles = []
authors = []
times = []
posts = []

# now we iterate through the metaData and parse it
# into titles, authors, and dates
print "Parsing data..."
for html in metaData:
    text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
    titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
    authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
    times.append(text.split(" on ")[1].strip()) # get date

# now we go through the actual post data and extract it
for post in postData:
    posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())

# now we write data to csv file
# ***csv files MUST be opened with the 'b' flag***
csvfile = open('silkroad.csv', 'wb')
writer = csv.writer(csvfile)

# create template
writer.writerow(["Time", "Author", "Title", "Post"])

# iterate through and write all the data
for time, author, title, post in zip(times, authors, titles, posts):
    writer.writerow([time, author, title, post])


# close file
csvfile.close()

# done
print "Operation completed successfully."

编辑:随附的解决方案可以从目录读取文件并使用其中的数据

Included solution that can read files from directory and use data from that

好的,因此您将HTML文件放在目录中.您需要获取目录中文件的列表,进行遍历,然后为目录中的每个文件添加到csv文件中.

Okay, so you have your HTML files in a directory. You need to get a list of files in the directory, iterate through them, and append to your csv file for each file in the directory.

这是我们新程序的基本逻辑.

如果我们有一个名为processData()的函数,该函数将文件路径作为参数并将文件中的数据附加到您的csv文件中,那么它将是这样:

If we had a function called processData() that took a file path as an argument and appended data from the file to your csv file here is what it would look like:

# the directory where we have all our HTML files
dir = "myDir"

# our csv file
csvFile = "silkroad.csv"

# insert the column titles to csv
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Time", "Author", "Title", "Post"])
csvfile.close()

# get a list of files in the directory
fileList = os.listdir(dir)

# define variables we need for status text
totalLen = len(fileList)
count = 1

# iterate through files and read all of them into the csv file
for htmlFile in fileList:
    path = os.path.join(dir, htmlFile) # get the file path
    processData(path) # process the data in the file
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
    count = count + 1 # increment counter

发生这种情况时,我们的 processData()函数或多或少是我们之前所做的,只是做了一些更改.

As it happens our processData() function is more or less what we did before, with a few changes.

因此,这与我们的上一个程序非常相似,但有一些小的更改:

So this is very similar to our last program, with a few small changes:

  1. 我们首先写列标题
  2. 接下来,我们打开带有"ab"标志的csv来添加
  3. 我们导入os以获得文件列表

这是下面的样子:

from bs4 import BeautifulSoup
import csv
import urllib2
import os # added this import to process files/dirs

# ** define our data processing function
def processData( pageFile ):
    ''' take the data from an html file and append to our csv file '''
    f = open(pageFile, "r")
    page = f.read()
    f.close()
    soup = BeautifulSoup(page)

    # if you look at the HTML all the titles, dates, 
    # and authors are stored inside of <dt ...> tags
    metaData = soup.find_all("dt")

    # likewise the post data is stored
    # under <dd ...>
    postData = soup.find_all("dd")

    # define where we will store info
    titles = []
    authors = []
    times = []
    posts = []

    # now we iterate through the metaData and parse it
    # into titles, authors, and dates
    for html in metaData:
        text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
        titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
        authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
        times.append(text.split(" on ")[1].strip()) # get date

    # now we go through the actual post data and extract it
    for post in postData:
        posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())

    # now we write data to csv file
    # ***csv files MUST be opened with the 'b' flag***
    csvfile = open('silkroad.csv', 'ab')
    writer = csv.writer(csvfile)

    # iterate through and write all the data
    for time, author, title, post in zip(times, authors, titles, posts):
        writer.writerow([time, author, title, post])

    # close file
    csvfile.close()
# ** start our process of going through files

# the directory where we have all our HTML files
dir = "myDir"

# our csv file
csvFile = "silkroad.csv"

# insert the column titles to csv
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Time", "Author", "Title", "Post"])
csvfile.close()

# get a list of files in the directory
fileList = os.listdir(dir)

# define variables we need for status text
totalLen = len(fileList)
count = 1

# iterate through files and read all of them into the csv file
for htmlFile in fileList:
    path = os.path.join(dir, htmlFile) # get the file path
    processData(path) # process the data in the file
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
    count = count + 1 # incriment counter

这篇关于BeautifulSoup-刮论坛页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆