无法在python中使用多线程读取/写入文件 [英] Can't read/write to files using multithreading in python

查看:101
本文介绍了无法在python中使用多线程读取/写入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个输入文件,其中包含一长串URL.让我们在mylines.txt中进行假设:

I have an input file which will contain a long list of URLs. Lets assume this in mylines.txt:

https://yahoo.com
https://google.com
https://facebook.com
https://twitter.com

我需要做的是:

  1. 从输入文件mylines.txt

执行myFun功能.它将执行一些任务.并产生由一条线组成的输出.在我的实际代码中,它更加复杂.但是概念上是这样的.

Execute myFun function. Which will perform some tasks. And produce an output that consists of a line. It is more complex in my real code. But something like this in concept.

将输出写入results.txt文件

由于我的投入很大.我需要利用python多线程.我看了这篇很好的在此处发布.但是不幸的是,它假定输入在一个简单的列表中,并且不假定我要将函数的输出写入文件中.

Since I have large input. I need to leverage python multithreading. I looked at this good post here. But unfortunately, it assumes input in a simple list, and does not assume I want to write the output of the function in a file.

我需要确保每个输入的输出都写在单行中(即,如果多线程正在写到同一行,则我会得到不正确的数据,这是危险的).

I need to ensure that each input's output is written in a single line (i.e. the danger if multithreads are writing to the same line so I get incorrect data).

我想弄乱.但是没有成功.我以前没有使用过python的多线程,但是现在该学习了,因为在我的情况下这是不可避免的.我的清单很长,没有多线程就无法在合理的时间内完成.我的功能不会执行此简单任务,而是执行此概念不需要的更多操作.

I tried to mess around. But no success. I did not use python's multithreading before but it is time to learn as it is inevitable in my case. I have a very long list which can not finish in a reasonable time without multithreading. My function will not do this simple task, but more operations that are not necessary for the concept.

这是我的尝试.请纠正我(在代码本身中):

Here is my attempt. Please correct me (in the code itself):

import threading
import requests
from multiprocessing.dummy import Pool as ThreadPool
import Queue

def myFunc(url):
        response = requests.get(url, verify=False ,timeout=(2, 5))
        results = open("myresults","a") # "a" to append results
        results.write("url is:",url, ", response is:", response.url)
        results.close()

worker_data = open("mylines.txt","r") # open my input file.

#load up a queue with your data, this will handle locking
q = Queue.Queue()

for url in worker_data:
    q.put(url)

# make the Pool of workers
pool = ThreadPool(4)
results = pool.map(myFunc, q)

# close the pool and wait for the work to finish
pool.close()
pool.join()

问:如何修复以上代码(请简明扼要,帮助我自己编写代码),以从输入文件中读取一行,执行该函数,并使用python多线程将与输入关联的结果写入一行中,以同时执行requests,这样我就可以在合理的时间内完成列表.

Q: How to fix the above code (please be concise and help me in the code itself) to read a line from the input file, execute the function, write the result associated with the input in a line using python multithreading to execute the requests concurrently so I can finish my list in a reasonable time.

更新:

根据答案,代码变为:

import threading
import requests
from multiprocessing.dummy import Pool as ThreadPool
import queue
from multiprocessing import Queue

def myFunc(url):
    response = requests.get(url, verify=False ,timeout=(2, 5))
    return "url is:" + url + ", response is:" + response.url

worker_data = open("mylines.txt","r") # open my input file.

#load up a queue with your data, this will handle locking
q = queue.Queue(4)
with open("mylines.txt","r") as f: # open my input file.
    for url in f:
        q.put(url)

# make the Pool of workers
pool = ThreadPool(4)
results = pool.map(myFunc, q)

with open("myresults","w") as f:
    for line in results:
        f.write(line + '\n')

mylines.txt包含:

The mylines.txt contains:

https://yahoo.com
https://www.google.com
https://facebook.com
https://twitter.com

请注意,我首先使用的是

Note that I first was using:

import Queue

并且: q = Queue.Queue(4)

And: q = Queue.Queue(4)

但出现错误:

Traceback (most recent call last):
  File "test3.py", line 4, in <module>
    import Queue
ModuleNotFoundError: No module named 'Queue'

根据一些搜索,我更改为:

Based on some search I changed to:

import queue

有关行: q = queue.Queue(4)

And the concerned line to: q = queue.Queue(4)

我还添加了:

from multiprocessing import Queue

但是没有任何效果. python多线程技术专家可以帮忙吗?

But nothing works. Can any expert in python multithreading help?

推荐答案

不是让工作池线程将结果打印出来,这不能保证正确缓冲输出,而是再创建一个线程,从线程中读取结果第二个Queue并打印它们.

Rather than have the worker pool threads print the result out, which is not guaranteed to buffer the output correctly, instead create one more thread, which reads results from a second Queue and prints them.

我已经修改了您的解决方案,因此它可以建立自己的工作线程池.给队列无限的长度没有什么意义,因为当队列达到最大大小时主线程将阻塞:您只需要足够长的时间以确保始终有工作线程可以处理工作线程-主线程将阻塞并随着队列大小的增加和减少而取消阻止.

I've modified your solution so it builds its own threadpool of workers. There's little point giving the queue an inifinite length, since the main thread will block when the queue reaches maximum size: you only need it to be long enough to make sure there's always work to be processed by the worker threads - the main thread will block and unblock as the queue size increases and decreases.

它还标识了负责输出队列中每个项目的线程,这应该使您确信多线程方法正在工作,并从服务器打印响应代码.我发现必须从URL中删除换行符.

It also identifies the thread responsible for each item on the output queue, which should give you some confidence that the multithreading approach is working, and prints the response code from the server. I found I had to strip the newlines from the URLs.

由于现在只有一个线程正在写入文件,所以写入始终完全同步,并且没有机会相互干扰.

Since now only one thread is writing to the file, writes are always perfectly in sync and there is no chance of them interfering with each other.

import threading
import requests
import queue
POOL_SIZE = 4

def myFunc(inq, outq):  # worker thread deals only with queues
    while True:
        url = inq.get()  # Blocks until something available
        if url is None:
            break
        response = requests.get(url.strip(), timeout=(2, 5))
        outq.put((url, response, threading.currentThread().name))


class Writer(threading.Thread):
    def __init__(self, q):
        super().__init__()
        self.results = open("myresults","a") # "a" to append results
        self.queue = q
    def run(self):
        while True:
            url, response, threadname = self.queue.get()
            if response is None:
                self.results.close()
                break
            print("****url is:",url, ", response is:", response.status_code, response.url, "thread", threadname, file=self.results)

#load up a queue with your data, this will handle locking
inq = queue.Queue()  # could usefully limit queue size here
outq = queue.Queue()

# start the Writer
writer = Writer(outq)
writer.start()

# make the Pool of workers
threads = []
for i in range(POOL_SIZE):
    thread = threading.Thread(target=myFunc, name=f"worker{i}", args=(inq, outq))
    thread.start()
    threads.append(thread)

# push the work onto the queues
with open("mylines.txt","r") as worker_data: # open my input file.
    for url in worker_data:
        inq.put(url.strip())
for thread in threads:
    inq.put(None)

# close the pool and wait for the workers to finish
for thread in threads:
    thread.join()

# Terminate the writer
outq.put((None, None, None))
writer.join()

使用mylines.txt中给出的数据,我看到以下输出:

Using the data given in mylines.txt I see the following output:

****url is: https://www.google.com , response is: 200 https://www.google.com/ thread worker1
****url is: https://twitter.com , response is: 200 https://twitter.com/ thread worker2
****url is: https://facebook.com , response is: 200 https://www.facebook.com/ thread worker0
****url is: https://www.censys.io , response is: 200 https://censys.io/ thread worker1
****url is: https://yahoo.com , response is: 200 https://uk.yahoo.com/?p=us thread worker3

这篇关于无法在python中使用多线程读取/写入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆