使用Tor代理与scrapy

我需要在Ubuntu中设置Tor并在scrapy框架中使用它的帮助。

我做了一些研究,找到了本指南:

class RetryChangeProxyMiddleware(RetryMiddleware):

def _retry(self, request, reason, spider):

log.msg('Changing proxy')

tn = telnetlib.Telnet('127.0.0.1', 9051)

tn.read_until("Escape character is '^]'.", 2)

tn.write('AUTHENTICATE "267765"\r\n')

tn.read_until("250 OK", 2)

tn.write("signal NEWNYM\r\n")

tn.read_until("250 OK", 2)

tn.write("quit\r\n")

tn.close()

time.sleep(3)

log.msg('Proxy changed')

return RetryMiddleware._retry(self, request, reason, spider)

然后在settings.py中使用它:

DOWNLOADER_MIDDLEWARE = {

'spider.middlewares.RetryChangeProxyMiddleware': 600,

}

然后你只想通过本地Tor代理(polipo)发送请求,可以使用以下方法完成:

tsocks scrapy crawl spirder 

有谁可以确认这种方法有效并且你获得了不同的IP?

回答:

You can use this middleware to have a random user agent every request the spider makes.

# You can define a user USER_AGEN_LIST in your settings and the spider will chose a random user agent from that list every time.

#

# You will have to disable the default user agent middleware and add this to your settings file.

#

# DOWNLOADER_MIDDLEWARES = {

# 'scraper.random_user_agent.RandomUserAgentMiddleware': 400,

# 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,

# }

from scraper.settings import USER_AGENT_LIST

import random

from scrapy import log

class RandomUserAgentMiddleware(object):

def process_request(self, request, spider):

ua = random.choice(USER_AGENT_LIST)

if ua:

request.headers.setdefault('User-Agent', ua)

#log.msg('>>>> UA %s'%request.headers)

# Snippet imported from snippets.scrapy.org (which no longer works)

# author: dushyant

# date : Sep 16, 2011

以上是 使用Tor代理与scrapy 的全部内容, 来源链接: utcz.com/qa/430024.html

回到顶部