【问题】
在用Python的urllib2等库,访问网络,发现某些网址访问很慢,比如:
http://www.wheelbynet.com/docs/auto/view_ad2.php3?ad_ref=auto58XXKHTS7098
但是,当使用代理(此处用的是gae)后,发现访问速度就快很多了。
所以,希望给Python的访问网络,增加代理的支持。
【折腾过程】
1.参考:
http://docs.python.org/2/library/urllib2.html
http://docs.python.org/2/library/urllib2.html#urllib2.ProxyHandler
urllib2.proxyhandler in python 2.5
去试试代码:
def initProxy(singleProxyDict = {}):
"""Add proxy support for later urllib2 auto use this proxy
Note:
1. tmp not support username and password
2. after this init, later urllib2.urlopen will automatically use this proxy
"""
proxyHandler = urllib2.ProxyHandler(singleProxyDict);
print "proxyHandler=",proxyHandler;
proxyOpener = urllib2.build_opener(proxyHandler);
print "proxyOpener=",proxyOpener;
urllib2.install_opener(proxyOpener);
urllib2.urlopen("http://www.baidu.com");
return;然后就可以看到对应的gae的代理被调用到了:
| INFO – [Jul 02 12:59:02] 127.0.0.1:52880 "GAE GET http://www.baidu.com HTTP/1.1" 200 10407 |
【总结】
如下函数:
def initProxy(singleProxyDict = {}):
"""Add proxy support for later urllib2 auto use this proxy
Note:
1. tmp not support username and password
2. after this init, later urllib2.urlopen will automatically use this proxy
"""
proxyHandler = urllib2.ProxyHandler(singleProxyDict);
print "proxyHandler=",proxyHandler;
proxyOpener = urllib2.build_opener(proxyHandler);
print "proxyOpener=",proxyOpener;
urllib2.install_opener(proxyOpener);
return;调用方法:
先初始化:
crifanLib.initProxy({'http':"http://127.0.0.1:8087"});正常使用:
任何后续的urllib2的访问网络,就已经使用到此代理了。比如:
urllib2.urlopen("http://www.baidu.com");如此即可。
【后记】
1.后来发现,此处有点问题:
得到的html都是乱码。
原因是用了cookie,又用了代理:
#init
crifanLib.initAutoHandleCookies();
#here use gae 127.0.0.1:8087
crifanLib.initProxy({'http':"http://127.0.0.1:8087"});
后来参考官网的解释:
The following exceptions are raised as appropriate: |
所以再去改为:
crifanLib.initProxyAndCookie({'http':http://127.0.0.1:8087});和
def initProxyAndCookie(singleProxyDict = {}, localCookieFileName=None):
"""Init proxy and cookie
Note:
1. after this init, later urllib2.urlopen will auto, use proxy, auto handle cookies
2. for proxy, tmp not support username and password
"""
proxyHandler = urllib2.ProxyHandler(singleProxyDict);
print "proxyHandler=",proxyHandler;
if(localCookieFileName):
gVal['cookieUseFile'] = True;
#print "use cookie file";
#gVal['cj'] = cookielib.FileCookieJar(localCookieFileName); #NotImplementedError
gVal['cj'] = cookielib.LWPCookieJar(localCookieFileName); # prefer use this
#gVal['cj'] = cookielib.MozillaCookieJar(localCookieFileName); # second consideration
#create cookie file
gVal['cj'].save();
else:
#print "not use cookie file";
gVal['cookieUseFile'] = False;
gVal['cj'] = cookielib.CookieJar();
proxyAndCookieOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(gVal['cj']), proxyHandler);
print "proxyAndCookieOpener=",proxyAndCookieOpener;
urllib2.install_opener(proxyAndCookieOpener);
return;结果还是返回的html是乱码。
2.感觉像是html的解压缩有问题。
进过一番折腾,参考:
去写了代码,终于是可以正常处理:
Content-Encoding: deflate
类型的html了。
(之前只能处理:
Content-Encoding: gzip
类型的html)
3.最后结果是:
上面的返回html是乱码,不是之前的urllib2的install_opener之类的导致的,而是返回的压缩的html,即gzip或deflate所导致的,最终通过如下代码:
def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=True, postDataDelimiter="&") :
resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip, postDataDelimiter);
respHtml = resp.read();
if(useGzip) :
#print "---before unzip, len(respHtml)=",len(respHtml);
respInfo = resp.info();
# Server: nginx/1.0.8
# Date: Sun, 08 Apr 2012 12:30:35 GMT
# Content-Type: text/html
# Transfer-Encoding: chunked
# Connection: close
# Vary: Accept-Encoding
# ...
# Content-Encoding: gzip
# sometime, the request use gzip,deflate, but actually returned is un-gzip html
# -> response info not include above "Content-Encoding: gzip"
# eg: http://blog.sina.com.cn/s/comment_730793bf010144j7_3.html
# -> so here only decode when it is indeed is gziped data
#Content-Encoding: deflate
if("Content-Encoding" in respInfo):
if("gzip" in respInfo['Content-Encoding']):
respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);
if("deflate" in respInfo['Content-Encoding']):
respHtml = zlib.decompress(respHtml, -zlib.MAX_WBITS);
return respHtml;而支持了是gzip或deflate。
注:更多关于crifanLib.py参见:
http://code.google.com/p/crifanlib/source/browse/trunk/python/crifanLib.py
转载请注明:在路上 » 【已解决】Python中使用代理访问网络