ThreadPoolExecutor vs threading.Thread

我有一个关于ThreadPoolExecutor vs Thread类的性能问题,在我看来,我缺乏一些基本的理解。ThreadPoolExecutor vs threading.Thread

我有两个功能的网络刮板。首先来分析一个网站主页和第二的每个图像的链接,加载图像关闭解析链接:

import threading 

import urllib.request

from bs4 import BeautifulSoup as bs

import os

from concurrent.futures import ThreadPoolExecutor

path = r'C:\Users\MyDocuments\Pythom\Networking\bbc_images_scraper_test'

url = 'https://www.bbc.co.uk'

# Function to parse link anchors for images

def img_links_parser(url, links_list):

res = urllib.request.urlopen(url)

soup = bs(res,'lxml')

content = soup.findAll('div',{'class':'top-story__image'})

for i in content:

try:

link = i.attrs['style']

# Pulling the anchor from parentheses

link = link[link.find('(')+1 : link.find(')')]

# Putting the anchor in the list of links

links_list.append(link)

except:

# links might be under 'data-lazy' attribute w/o paranthesis

links_list.append(i.attrs['data-lazy'])

# Function to load images from links

def img_loader(base_url, links_list, path_location):

for link in links_list:

try:

# Pulling last element off the link which is name.jpg

file_name = link.split('/')[-1]

# Following the link and saving content in a given direcotory

urllib.request.urlretrieve(urllib.parse.urljoin(base_url, link),

os.path.join(path_location, file_name))

except:

print('Error on {}'.format(urllib.parse.urljoin(base_url, link)))

下面的代码是在两种情况分裂:

案例1:我使用多线程:

threads = [] 

t1 = threading.Thread(target = img_loader, args = (url, links[:10], path))

t2 = threading.Thread(target = img_loader, args = (url, links[10:20], path))

t3 = threading.Thread(target = img_loader, args = (url, links[20:30], path))

t4 = threading.Thread(target = img_loader, args = (url, links[30:40], path))

t5 = threading.Thread(target = img_loader, args = (url, links[40:50], path))

t6 = threading.Thread(target = img_loader, args = (url, links[50:], path))

threads.extend([t1,t2,t3,t4,t5,t6])

for t in threads:

t.start()

for t in threads:

t.join()

上述代码在我的机器上执行了10秒钟的工作。

情况2:我使用ThreadPoolExecutor

with ThreadPoolExecutor(50) as exec: 

results = exec.submit(img_loader, url, links, path)

上面的代码结果18秒。

我的理解是,ThreadPoolExecutor为每个工人创建一个线程。所以,假设我将max_workers设置为50会导致50个线程,因此应该更快地完成作业。

有人可以请解释我在这里错过了什么?我承认我在这里犯了一个愚蠢的错误,但我不明白。

非常感谢!

回答:

在案例2中,您将所有链接发送给一名工作人员。取而代之的

exec.submit(img_loader, url, links, path) 

你需要:

for link in links: 

exec.submit(img_loader, url, [link], path)

我不尝试一下我自己,从reading the documentation of ThreadPoolExecutor

以上是 ThreadPoolExecutor vs threading.Thread 的全部内容, 来源链接: utcz.com/qa/266341.html

回到顶部