爬虫代理池的设置
Date: 2019/05/09 Categories: 工作 Tags: 爬虫
因为搜狗爬虫的问题, 手头的ip不是特别够, 所以调研了一下, 找到了一个python写的代理池项目scylla.
记录一下运行方法:
docker pull docker.io/wildcat/scylla
ocker run -it --net=host --entrypoint /usr/local/bin/python --name scylla docker.oa.com/andyfei/scylla_proxy_pool -m scylla --web-port 8080 --web-host 127.0.0.1 --no-forward-proxy-server
获取代理列表的方法(可能有多页)
curl 127.0.0.1:8080/api/v1/proxies\?page=1
获取所有代理可以用下面的代码
import requests
def get_proxies(page=1):
r=requests.get('http://10.175.130.91:8080/api/v1/proxies', params={'page':1})
obj = r.json()
obj['total_page'] = obj['total_page']
proxies = ['http{}://{}:{}'.format( i['is_https'] and 's' or '', i['ip'],i['port'])\
for i in obj['proxies'] if i['is_anonymous']]
return proxies, int(obj['total_page'])
def get_all_proxies():
proxies, total_page = get_proxies()
for i in range(2, total_page):
p, _ = get_proxies(i)
proxies += p
return proxies
proxies = get_all_proxies()
with open('proxy.txt', 'w') as f:
for i in proxies:
f.write(i)
f.write('\n')