使用Scrapy从Python中的Microsoft Word文件中提取文本 [英] Extracting text from Microsoft Word files in Python with Scrapy

查看:128
本文介绍了使用Scrapy从Python中的Microsoft Word文件中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的示例代码以及带有Python的Scrapy代码,用于从网站中提取word.doc和.docx文件.

Here is my sample code with Scrapy code with Python to extract word.doc and a .docx file extract from a website.

import StringIO
from functools import partial
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from pyPdf import PdfFileReader
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import Spider
from scrapy.selector import Selector

import urlparse
from miette import DocReader
from scrapy.item import Item, Field

class wordSpiderItem(Item):

    link = Field()
    title = Field()
    Description = Field()

class wordSpider(CrawlSpider):

    name = "penyrheol"

    # Stay within these domains when crawling
    allowed_domains = ["penyrheol-comp.net"]
    start_urls = ["http://penyrheol-comp.net/vacancy"]


    def parse(self, response):
            ##hxs = Selector(response)

            listings = response.xpath('//div[@class="entry-content"]')
            links = []

            # Scrap listings page to get listing links
            for listing in listings:
                link = listing.xpath('//div[@class="afi-document-
                                    link"]/a/@href').extract()

                links.extend(link)

            # Parse listing URL to get content of the listing page

            for link in links:
                item = wordSpiderItem()
                item['link'] = link
                if "doc" in link:
                    yield Request(urlparse.urljoin(response.url, link),
                    meta = {'item':item}, callback=self.parse_data)


        def parse_data(self, response):
            #hxs = Selector(response)

            job = wordSpiderItem()
            job['link'] = response.url
            stream = StringIO.StringIO(response.body)
            reader = DocReader(stream)
            for page in reader.pages:
               job['Description'] = page.extractText()

               return job

我收到以下错误.请检查一下,让我知道如何使用此代码实现...

I get the following error. Please check it and let me know how to implement with this code...

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
C:\Documents and Settings\sureshp>cd D:\Final\penyrheolcomp
C:\Documents and Settings\sureshp>d:
D:\Final\penyrheolcomp>scrapy crawl penyrheol -o testd.json -t json
2014-09-05 17:49:55+0530 [scrapy] INFO: Scrapy 0.24.2 started (bot: penyrheolcom
p)
2014-09-05 17:49:55+0530 [scrapy] INFO: Optional features available: ssl, http11
2014-09-05 17:49:55+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'penyrheolcomp.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['penyrheolc
omp.spiders'], 'FEED_URI': 'testd.json', 'BOT_NAME': 'penyrheolcomp'}
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled extensions: FeedExporter, LogSta
ts, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled item pipelines:
2014-09-05 17:49:56+0530 [penyrheol] INFO: Spider opened
2014-09-05 17:49:56+0530 [penyrheol] INFO: Crawled 0 pages (at 0 pages/min), scr
aped 0 items (at 0 items/min)
2014-09-05 17:49:56+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2014-09-05 17:49:56+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-09-05 17:49:57+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/vacancy> (referer: None)
2014-09-05 17:49:59+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/wp-content/uploads/2014/05/Application-form-Teaching.doc> (referer: htt
p://penyrheol-comp.net/vacancy)
2014-09-05 17:49:59+0530 [penyrheol] ERROR: Spider error processing <GET http://
penyrheol-comp.net/wp-content/uploads/2014/05/Application-form-Teaching.doc>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "penyrheolcomp\spiders\penyrheolcompnet.py", line 58, in parse_da
ta
reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instan
ce found
2014-09-05 17:50:00+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/wp-content/uploads/2014/05/Information-pack-Teacher-of-English-0.6-Temp
orary.doc> (referer: http://penyrheol-comp.net/vacancy)
2014-09-05 17:50:00+0530 [penyrheol] ERROR: Spider error processing <GET http://
penyrheol-comp.net/wp-content/uploads/2014/05/Information-pack-Teacher-of-Englis
h-0.6-Temporary.doc>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "penyrheolcomp\spiders\penyrheolcompnet.py", line 58, in parse_da
ta
reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instan
ce found
2014-09-05 17:50:00+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/wp-content/uploads/2014/05/Advert-Teacher-of-English-Temp-0.6.doc> (ref
erer: http://penyrheol-comp.net/vacancy)
2014-09-05 17:50:00+0530 [penyrheol] ERROR: Spider error processing <GET http://
penyrheol-comp.net/wp-content/uploads/2014/05/Advert-Teacher-of-English-Temp-0.6
.doc>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "penyrheolcomp\spiders\penyrheolcompnet.py", line 58, in parse_da
ta
reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instan
ce found
2014-09-05 17:50:01+0530 [penyrheol] INFO: Closing spider (finished)
2014-09-05 17:50:01+0530 [penyrheol] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1208,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 677942,
'downloader/response_count': 4,
'downloader/response_status_count/200': 4,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 9, 5, 12, 20, 1, 140000),
'log_count/DEBUG': 6,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 4,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/TypeError': 3,
'start_time': datetime.datetime(2014, 9, 5, 12, 19, 56, 234000)}
2014-09-05 17:50:01+0530 [penyrheol] INFO: Spider closed (finished)
D:\Final\penyrheolcomp>`enter code here`

推荐答案

您的错误隐藏在您的堆栈跟踪中:

Your error is hiding in your stacktrace:

reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instance found

根据 https://github.com/rembish/Miette/blob/master/miette/doc.py ,DocReader __init__将获取所需文档的文件名,而不是其正文.

According to https://github.com/rembish/Miette/blob/master/miette/doc.py , the DocReader __init__ takes the filename of the document you want - not its body.

要解决此问题,您可以将response.body写入一个临时文件,然后将DocReader指向该临时文件.

To get around this, you could write response.body to a temporary file, and then point your DocReader to that temporary file.

这篇关于使用Scrapy从Python中的Microsoft Word文件中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆