【已解决】Python中使用代理访问网络

【问题】

在用Python的urllib2等库,访问网络,发现某些网址访问很慢,比如:

http://www.wheelbynet.com/docs/auto/view_ad2.php3?ad_ref=auto58XXKHTS7098

但是,当使用代理(此处用的是gae)后,发现访问速度就快很多了。

所以,希望给Python的访问网络,增加代理的支持。

【折腾过程】

1.参考:

Python urllib/urlib2 代理使用方法

http://docs.python.org/2/library/urllib2.html

http://docs.python.org/2/library/urllib2.html#urllib2.ProxyHandler

urllib2.proxyhandler in python 2.5

去试试代码:

def initProxy(singleProxyDict = {}):
    """Add proxy support for later urllib2 auto use this proxy
    
    Note:
    1. tmp not support username and password
    2. after this init, later urllib2.urlopen will automatically use this proxy
    """

    proxyHandler = urllib2.ProxyHandler(singleProxyDict);
    print "proxyHandler=",proxyHandler;
    proxyOpener = urllib2.build_opener(proxyHandler);
    print "proxyOpener=",proxyOpener;
    urllib2.install_opener(proxyOpener);
    urllib2.urlopen("http://www.baidu.com");
    
    return;

然后就可以看到对应的gae的代理被调用到了:

INFO – [Jul 02 12:59:02] 127.0.0.1:52880 "GAE GET http://www.baidu.com HTTP/1.1" 200 10407

 

【总结】

如下函数:

def initProxy(singleProxyDict = {}):
    """Add proxy support for later urllib2 auto use this proxy
    
    Note:
    1. tmp not support username and password
    2. after this init, later urllib2.urlopen will automatically use this proxy
    """

    proxyHandler = urllib2.ProxyHandler(singleProxyDict);
    print "proxyHandler=",proxyHandler;
    proxyOpener = urllib2.build_opener(proxyHandler);
    print "proxyOpener=",proxyOpener;
    urllib2.install_opener(proxyOpener);
    
    return;

调用方法:

先初始化:

crifanLib.initProxy({'http':"http://127.0.0.1:8087"});

正常使用:

任何后续的urllib2的访问网络,就已经使用到此代理了。比如:

urllib2.urlopen("http://www.baidu.com");

如此即可。

 


【后记】

1.后来发现,此处有点问题:

得到的html都是乱码。

原因是用了cookie,又用了代理:

    #init
    crifanLib.initAutoHandleCookies();
    #here use gae 127.0.0.1:8087
    crifanLib.initProxy({'http':"http://127.0.0.1:8087"});

 

后来参考官网的解释:

urllib2.build_opener([handler, ])

Return an OpenerDirector instance, which chains the handlers in the order given. handlers can be either instances of BaseHandler, or subclasses of BaseHandler (in which case it must be possible to call the constructor without any parameters). Instances of the following classes will be in front of the handlers, unless the handlers contain them, instances of them or subclasses of them: ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor.

If the Python installation has SSL support (i.e., if the ssl module can be imported), HTTPSHandler will also be added.

Beginning in Python 2.3, a BaseHandler subclass may also change its handler_order attribute to modify its position in the handlers list.

The following exceptions are raised as appropriate:

所以再去改为:

crifanLib.initProxyAndCookie({'http':http://127.0.0.1:8087});

    
def initProxyAndCookie(singleProxyDict = {}, localCookieFileName=None):
    """Init proxy and cookie
    
    Note:
    1. after this init, later urllib2.urlopen will auto, use proxy, auto handle cookies
    2. for proxy, tmp not support username and password
    """

    proxyHandler = urllib2.ProxyHandler(singleProxyDict);
    print "proxyHandler=",proxyHandler;
    
    if(localCookieFileName):
        gVal['cookieUseFile'] = True;
        #print "use cookie file";
        
        #gVal['cj'] = cookielib.FileCookieJar(localCookieFileName); #NotImplementedError
        gVal['cj'] = cookielib.LWPCookieJar(localCookieFileName); # prefer use this
        #gVal['cj'] = cookielib.MozillaCookieJar(localCookieFileName); # second consideration
        #create cookie file
        gVal['cj'].save();
    else:
        #print "not use cookie file";
        gVal['cookieUseFile'] = False;
        
        gVal['cj'] = cookielib.CookieJar();

    proxyAndCookieOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(gVal['cj']), proxyHandler);
    print "proxyAndCookieOpener=",proxyAndCookieOpener;
    urllib2.install_opener(proxyAndCookieOpener);
    
    return;

结果还是返回的html是乱码。

2.感觉像是html的解压缩有问题。

进过一番折腾,参考:

使用python爬虫抓站的一些技巧总结:进阶篇

去写了代码,终于是可以正常处理:

Content-Encoding: deflate

类型的html了。

(之前只能处理:

Content-Encoding: gzip

类型的html)

3.最后结果是:

上面的返回html是乱码,不是之前的urllib2的install_opener之类的导致的,而是返回的压缩的html,即gzip或deflate所导致的,最终通过如下代码:

def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=True, postDataDelimiter="&") :
    resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip, postDataDelimiter);
    respHtml = resp.read();
    if(useGzip) :
        #print "---before unzip, len(respHtml)=",len(respHtml);
        respInfo = resp.info();
        
        # Server: nginx/1.0.8
        # Date: Sun, 08 Apr 2012 12:30:35 GMT
        # Content-Type: text/html
        # Transfer-Encoding: chunked
        # Connection: close
        # Vary: Accept-Encoding
        # ...
        # Content-Encoding: gzip
        
        # sometime, the request use gzip,deflate, but actually returned is un-gzip html
        # -> response info not include above "Content-Encoding: gzip"
        # eg: http://blog.sina.com.cn/s/comment_730793bf010144j7_3.html
        # -> so here only decode when it is indeed is gziped data
        
        #Content-Encoding: deflate
        if("Content-Encoding" in respInfo):
            if("gzip" in respInfo['Content-Encoding']):
                respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);
            
            if("deflate" in respInfo['Content-Encoding']):
                respHtml = zlib.decompress(respHtml, -zlib.MAX_WBITS);

    return respHtml;

而支持了是gzip或deflate。

 

注:更多关于crifanLib.py参见:

http://code.google.com/p/crifanlib/source/browse/trunk/python/crifanLib.py



发表评论

电子邮件地址不会被公开。 必填项已用*标注

无觅相关文章插件,快速提升流量