到数据库的非阻塞 Scrapy 管道 [英] Nonblocking Scrapy pipeline to database
问题描述
我在 Scrapy 中有一个用于获取数据项的网络抓取工具.我也想将它们异步插入到数据库中.
I have a web scraper in Scrapy that gets data items. I want to asynchronously insert them into a database as well.
例如,我有一个事务,它使用 SQLAlchemy Core 将一些项目插入到我的数据库中:
For example, I have a transaction that inserts some items into my db using SQLAlchemy Core:
def process_item(self, item, spider):
with self.connection.begin() as conn:
conn.execute(insert(table1).values(item['part1'])
conn.execute(insert(table2).values(item['part2'])
我知道可以通过 alchimia
<在 Twisted 中异步使用 SQLAlchemy Core/a>.alchimia
的文档代码示例如下.
I understand that it's possible to use SQLAlchemy Core asynchronously with Twisted with alchimia
. The documentation code example for alchimia
is below.
我不明白的是如何在alchimia框架中使用我上面的代码.如何设置 process_item
以使用反应器?
What I don't understand is how can I use my above code in the alchimia framework. How can I set up process_item
to use a reactor?
我可以做这样的事情吗?
Can I do something like this?
@inlineCallbacks
def process_item(self, item, spider):
with self.connection.begin() as conn:
yield conn.execute(insert(table1).values(item['part1'])
yield conn.execute(insert(table2).values(item['part2'])
反应堆部分怎么写?
或者是否有更简单的方法在 Scrapy 管道中进行非阻塞数据库插入?
作为参考,这里是 alchimia
文档中的代码示例:
For reference, here is the code example from alchimia
's documentation:
from alchimia import TWISTED_STRATEGY
from sqlalchemy import (
create_engine, MetaData, Table, Column, Integer, String
)
from sqlalchemy.schema import CreateTable
from twisted.internet.defer import inlineCallbacks
from twisted.internet.task import react
@inlineCallbacks
def main(reactor):
engine = create_engine(
"sqlite://", reactor=reactor, strategy=TWISTED_STRATEGY
)
metadata = MetaData()
users = Table("users", metadata,
Column("id", Integer(), primary_key=True),
Column("name", String()),
)
# Create the table
yield engine.execute(CreateTable(users))
# Insert some users
yield engine.execute(users.insert().values(name="Jeremy Goodwin"))
yield engine.execute(users.insert().values(name="Natalie Hurley"))
yield engine.execute(users.insert().values(name="Dan Rydell"))
yield engine.execute(users.insert().values(name="Casey McCall"))
yield engine.execute(users.insert().values(name="Dana Whitaker"))
result = yield engine.execute(users.select(users.c.name.startswith("D")))
d_users = yield result.fetchall()
# Print out the users
for user in d_users:
print "Username: %s" % user[users.c.name]
if __name__ == "__main__":
react(main, [])
推荐答案
如何设置 process_item 以使用反应器?
How can I set up process_item to use a reactor?
您无需管理管道中的另一个反应器.
相反,您可以通过从管道返回延迟对象来在项目管道内进行异步数据库交互.
You don't need to manage another reactor in your pipeline.
Instead, you could do asynchronous database interactions within an item pipeline by returning a deferred from the pipeline.
另见 Scrapy 的文档 和
这篇关于到数据库的非阻塞 Scrapy 管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!