python多线程get请求报错urllib3.connectionpool Failed to parse headers

Z时代
2024-01-10
分类：技术分享

程序需求简述

使用多线程批量向指定的一些url发送get请求（这些url都不重复）

问题描述

在requests请求中已经设置了timeout为3秒，程序运行后先是正常输出，然后在一段时间内没输出请求结果，观察发现进程中有大量线程未关闭，程序运行一段时间后出现标题所述的错误（详细错误信息已在下方贴出），查阅了相关案例可能是keep_alive的问题，于是设置了 req.keep_alive = False 但是无果

各位前辈帮忙看下是啥原因造成的，万分感谢，问题可能描述得不清楚，请见谅

程序代码贴在最后了

详细错误信息如下

URL内容使用XXX替换了，报错时内容中的URL是能正常访问的

2020-01-18 02:33:55,363 urllib3.connectionpool [WARNING] - Failed to parse headers (url=https://XXX/XXX.conf): [MissingHeaderBodySeparatorDefect()], unparsed data: "Mí\x99\x81M\x8fMIû+Pµ!ó:aç\x96\x90QÔIhvNOÄÍùS\x16.\x03UiqØÉó\x0c\x9b®'Oj\x15þ\x06\x1b\x93\x18\x8dçøÈþjw\x89è\\\x0bõ\x7f\x10Q*¢\xa0\x06ÿm/\x02^(aÐ\x12\x9bË¯ÈkfÙSÉ\x81\x9a8§\xa0\\\x9938g\x88Âdñ=ÊaÑuv®\x8e^õ2\x9a»»\x1cÎê¾ásóÆðAÅ:÷ú¯·2®\x1fyä{¼ãÀ¢¦,ÃR7L\x9ff!`\x15\x81<©*»{ï(+.ÐW½Ñ»ß\x8dÅ.\x1c¨·¢\x91àr´cÙÆ=-ÄÜ¡;HttpOnly;Path=/;Secure\r\nSet-Cookie: NSC_AAAC=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT;Secure\r\nSet-Cookie: NSC_EPAC=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT;Secure\r\nSet-Cookie: NSC_USER=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT;Secure\r\nSet-Cookie: NSC_TEMP=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT;Secure\r\nSet-Cookie: NSC_PERS=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT;Secure\r\nSet-Cookie: NSC_BASEURL=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT;Secure\r\nSet-Cookie: CsrfToken=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT;Secure\r\nSet-Cookie: CtxsAuthId=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT;Secure\r\nSet-Cookie: ASP.NET_SessionId=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT;Secure\r\nSet-Cookie: NSC_TMAA=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT\r\nSet-Cookie: NSC_TMAS=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT;Secure\r\nSet-Cookie: NSC_TEMP=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT\r\nSet-Cookie: NSC_PERS=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT\r\nSet-Cookie: NSC_AAAC=xyz;Path=/;expires=Wednesday, 09-Nov-1999 23:12:40 GMT\r\nConnection: close\r\nContent-Length: 551\r\nCache-control: no-cache, no-store, must-revalidate\r\nPragma: no-cache\r\nContent-Type: text/html\r\n\r\n"
Traceback (most recent call last):
  File "D:\soft\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 441, in _make_request
    assert_header_parsing(httplib_response.msg)
  File "D:\soft\Python\Python37\lib\site-packages\urllib3\util\response.py", line 71, in assert_header_parsing
    raise HeaderParsingError(defects=defects, unparsed_data=unparsed_data)urllib3.exceptions.HeaderParsingError: [MissingHeaderBodySeparatorDefect()], unparsed data: "Mí\x99\x81M\x8fMIû+Pµ!ó:aç\x96\x90QÔIhvNOÄÍùS

程序代码


def checking(url):
    # 业务逻辑
    try:
        url_new = '%s/xxx.html' % url
        header = {
            'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
        }
        req = requests.session()
        req.keep_alive = False  # 尝试关闭urllib中的keep-alive
        res = req.get(url_new, headers=header,timeout=3, verify=False, allow_redirects=False)
        if 'target_text' in str(res.content):
            logger.info('[+] task %s is SUCC' % (url))
        else:
            logger.info('[-] task %s is FAIL' % (url))
    except:
        pass
def get_url_list(filename):
    url_list = []
    with open(filename, 'r', encoding='utf-8') as file:
        while True:
            url = file.readline().strip()
            if not url:
                break
            else:
                if url != '': url_list.append(url)
                print('\r已读取 %s' % len(url_list), end='', flush=True)
    print('')
    return url_list
if __name__ == '__main__':
    # 读取url
    url_list = get_url_list('data/host.txt')
    thread_list = []
    for url in  url_list: 
        thread = Thread(target=checking, args=(url,)).start()
        thread_list.append(thread)
        time.sleep(0.05)
    for th in thread_list:
        th.join()
    print('全部线程结束')