使用Scrapy从Python中的Microsoft Word文件中提取文本 [英] Extracting text from Microsoft Word files in Python with Scrapy

查看：128 发布时间：2020/5/13 1:59:28 windows python-2.7 ms-word scrapy screen-scraping

本文介绍了使用Scrapy从Python中的Microsoft Word文件中提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我的示例代码以及带有Python的Scrapy代码，用于从网站中提取word.doc和.docx文件.

Here is my sample code with Scrapy code with Python to extract word.doc and a .docx file extract from a website.

import StringIO
from functools import partial
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from pyPdf import PdfFileReader
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import Spider
from scrapy.selector import Selector

import urlparse
from miette import DocReader
from scrapy.item import Item, Field

class wordSpiderItem(Item):

    link = Field()
    title = Field()
    Description = Field()

class wordSpider(CrawlSpider):

    name = "penyrheol"

    # Stay within these domains when crawling
    allowed_domains = ["penyrheol-comp.net"]
    start_urls = ["http://penyrheol-comp.net/vacancy"]


    def parse(self, response):
            ##hxs = Selector(response)

            listings = response.xpath('//div[@class="entry-content"]')
            links = []

            # Scrap listings page to get listing links
            for listing in listings:
                link = listing.xpath('//div[@class="afi-document-
                                    link"]/a/@href').extract()

                links.extend(link)

            # Parse listing URL to get content of the listing page

            for link in links:
                item = wordSpiderItem()
                item['link'] = link
                if "doc" in link:
                    yield Request(urlparse.urljoin(response.url, link),
                    meta = {'item':item}, callback=self.parse_data)


        def parse_data(self, response):
            #hxs = Selector(response)

            job = wordSpiderItem()
            job['link'] = response.url
            stream = StringIO.StringIO(response.body)
            reader = DocReader(stream)
            for page in reader.pages:
               job['Description'] = page.extractText()

               return job

我收到以下错误.请检查一下，让我知道如何使用此代码实现...

I get the following error. Please check it and let me know how to implement with this code...

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
C:\Documents and Settings\sureshp>cd D:\Final\penyrheolcomp
C:\Documents and Settings\sureshp>d:
D:\Final\penyrheolcomp>scrapy crawl penyrheol -o testd.json -t json
2014-09-05 17:49:55+0530 [scrapy] INFO: Scrapy 0.24.2 started (bot: penyrheolcom
p)
2014-09-05 17:49:55+0530 [scrapy] INFO: Optional features available: ssl, http11
2014-09-05 17:49:55+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'penyrheolcomp.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['penyrheolc
omp.spiders'], 'FEED_URI': 'testd.json', 'BOT_NAME': 'penyrheolcomp'}
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled extensions: FeedExporter, LogSta
ts, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled item pipelines:
2014-09-05 17:49:56+0530 [penyrheol] INFO: Spider opened
2014-09-05 17:49:56+0530 [penyrheol] INFO: Crawled 0 pages (at 0 pages/min), scr
aped 0 items (at 0 items/min)
2014-09-05 17:49:56+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2014-09-05 17:49:56+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-09-05 17:49:57+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/vacancy> (referer: None)
2014-09-05 17:49:59+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/wp-content/uploads/2014/05/Application-form-Teaching.doc> (referer: htt
p://penyrheol-comp.net/vacancy)
2014-09-05 17:49:59+0530 [penyrheol] ERROR: Spider error processing <GET http://
penyrheol-comp.net/wp-content/uploads/2014/05/Application-form-Teaching.doc>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "penyrheolcomp\spiders\penyrheolcompnet.py", line 58, in parse_da
ta
reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instan
ce found
2014-09-05 17:50:00+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/wp-content/uploads/2014/05/Information-pack-Teacher-of-English-0.6-Temp
orary.doc> (referer: http://penyrheol-comp.net/vacancy)
2014-09-05 17:50:00+0530 [penyrheol] ERROR: Spider error processing <GET http://
penyrheol-comp.net/wp-content/uploads/2014/05/Information-pack-Teacher-of-Englis
h-0.6-Temporary.doc>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "penyrheolcomp\spiders\penyrheolcompnet.py", line 58, in parse_da
ta
reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instan
ce found
2014-09-05 17:50:00+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/wp-content/uploads/2014/05/Advert-Teacher-of-English-Temp-0.6.doc> (ref
erer: http://penyrheol-comp.net/vacancy)
2014-09-05 17:50:00+0530 [penyrheol] ERROR: Spider error processing <GET http://
penyrheol-comp.net/wp-content/uploads/2014/05/Advert-Teacher-of-English-Temp-0.6
.doc>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "penyrheolcomp\spiders\penyrheolcompnet.py", line 58, in parse_da
ta
reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instan
ce found
2014-09-05 17:50:01+0530 [penyrheol] INFO: Closing spider (finished)
2014-09-05 17:50:01+0530 [penyrheol] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1208,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 677942,
'downloader/response_count': 4,
'downloader/response_status_count/200': 4,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 9, 5, 12, 20, 1, 140000),
'log_count/DEBUG': 6,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 4,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/TypeError': 3,
'start_time': datetime.datetime(2014, 9, 5, 12, 19, 56, 234000)}
2014-09-05 17:50:01+0530 [penyrheol] INFO: Spider closed (finished)
D:\Final\penyrheolcomp>`enter code here`

使用Scrapy从Python中的Microsoft Word文件中提取文本 [英] Extracting text from Microsoft Word files in Python with Scrapy

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用Scrapy从Python中的Microsoft Word文件中提取文本 [英] Extracting text from Microsoft Word files in Python with Scrapy

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭