最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】给Python中通过urllib2.urlopen获取网页的过程中,添加gzip的压缩与解压缩支持

Python crifan 5355浏览 0评论

【背景】

之前已经实现了用python获取网页的内容,相关已实现代码为:

#------------------------------------------------------------------------------
# get response from url
# note: if you have already used cookiejar, then here will automatically use it
# while using rllib2.Request
def getUrlResponse(url, postDict={}, headerDict={}) :
    # makesure url is string, not unicode, otherwise urllib2.urlopen will error
    url = str(url);

    if (postDict) :
        postData = urllib.urlencode(postDict);
        req = urllib2.Request(url, postData);
        req.add_header('Content-Type', "application/x-www-form-urlencoded");
    else :
        req = urllib2.Request(url);

    if(headerDict) :
        print "added header:",headerDict;
        for key in headerDict.keys() :
            req.add_header(key, headerDict[key]);

    req.add_header('User-Agent', gConst['userAgentIE9']);
    req.add_header('Cache-Control', 'no-cache');
    req.add_header('Accept', '*/*');
    #req.add_header('Accept-Encoding', 'gzip, deflate');
    req.add_header('Connection', 'Keep-Alive');
    resp = urllib2.urlopen(req);
    
    return resp;

#------------------------------------------------------------------------------
# get response html==body from url
def getUrlRespHtml(url, postDict={}, headerDict={}) :
    resp = getUrlResponse(url, postDict, headerDict);
    respHtml = resp.read();
    return respHtml;

其中,是不支持html的压缩已解压缩的。

现在想要支持相关的压缩与解压缩。

其中,关于这部分内容,之前就已经通过C#实现了对应的功能,了解了对应的逻辑。所以,此处主要是具体是如何用python实现而已,对于内部机制,基本已经了解过了。

【解决过程】

1.之前就简单找过相关的帖子看,但是当时没来得及解决。

现在知道了,是要先对http的request添加gzip的header的,具体python代码是:

req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

然后返回的http的response中,read所得到的数据,就是gzip后的压缩的数据了。

接下来就是想要搞懂,如何将其解压出来。

2.先去找了下gzip的解释,发现python官方文档中,是这样说的:

12.2. gzip — Support for gzip files

This module provides a simple interface to compress and decompress files just like the GNU programs gzip and gunzip would.

The data compression is provided by the zlib module.

即,gzip是针对文件来压缩与解压缩的。,而对于数据压缩与解压,是用zlib。

所以又去查看zlib:

zlib.decompress(string[, wbits[, bufsize]])

Decompresses the data in string, returning a string containing the uncompressed data. The wbits parameter controls the size of the window buffer, and is discussed further below. If bufsize is given, it is used as the initial size of the output buffer. Raises the error exception if any error occurs.

The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. When decompressing a stream, wbits must not be smaller than the size originally used to compress the stream; using a too-small value will result in an exception. The default value is therefore the highest value, 15. When wbits is negative, the standard gzip header is suppressed.

bufsize is the initial size of the buffer used to hold decompressed data. If more space is required, the buffer size will be increased as needed, so you don’t have to get this value exactly right; tuning it will only save a few calls to malloc(). The default size is 16384.

然后程序中直接去用:zlib.decompress,结果出错,后来解决了,具体过程见:

【已解决】Python中用zlib.decompress出错:error: Error -3 while decompressing data: incorrect header check

然后,就可以实现将返回的html解压了。

3.参考了这里:

http://flyash.itcao.com/post_1117.html

才知道可以去判断其中返回的http的response中,是否包含Content-Encoding: gzip,然后再决定是否去调用zlib去解压缩的。

4.最后实现了对应的全部代码,如下:

#------------------------------------------------------------------------------
# get response from url
# note: if you have already used cookiejar, then here will automatically use it
# while using rllib2.Request
def getUrlResponse(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :
    # makesure url is string, not unicode, otherwise urllib2.urlopen will error
    url = str(url);

    if (postDict) :
        postData = urllib.urlencode(postDict);
        req = urllib2.Request(url, postData);
        req.add_header('Content-Type', "application/x-www-form-urlencoded");
    else :
        req = urllib2.Request(url);

    if(headerDict) :
        #print "added header:",headerDict;
        for key in headerDict.keys() :
            req.add_header(key, headerDict[key]);

    defHeaderDict = {
        'User-Agent'    : gConst['userAgentIE9'],
        'Cache-Control' : 'no-cache',
        'Accept'        : '*/*',
        'Connection'    : 'Keep-Alive',
    };

    # add default headers firstly
    for eachDefHd in defHeaderDict.keys() :
        #print "add default header: %s=%s"%(eachDefHd,defHeaderDict[eachDefHd]);
        req.add_header(eachDefHd, defHeaderDict[eachDefHd]);

    if(useGzip) :
        #print "use gzip for",url;
        req.add_header('Accept-Encoding', 'gzip, deflate');

    # add customized header later -> allow overwrite default header 
    if(headerDict) :
        #print "added header:",headerDict;
        for key in headerDict.keys() :
            req.add_header(key, headerDict[key]);

    if(timeout > 0) :
        # set timeout value if necessary
        resp = urllib2.urlopen(req, timeout=timeout);
    else :
        resp = urllib2.urlopen(req);
    
    return resp;

#------------------------------------------------------------------------------
# get response html==body from url
#def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :
def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=True) :
    resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip);
    respHtml = resp.read();
    if(useGzip) :
        #print "---before unzip, len(respHtml)=",len(respHtml);
        respInfo = resp.info();
        
        # Server: nginx/1.0.8
        # Date: Sun, 08 Apr 2012 12:30:35 GMT
        # Content-Type: text/html
        # Transfer-Encoding: chunked
        # Connection: close
        # Vary: Accept-Encoding
        # ...
        # Content-Encoding: gzip
        
        # sometime, the request use gzip,deflate, but actually returned is un-gzip html
        # -> response info not include above "Content-Encoding: gzip"
        # eg: http://blog.sina.com.cn/s/comment_730793bf010144j7_3.html
        # -> so here only decode when it is indeed is gziped data
        if( ("Content-Encoding" in respInfo) and (respInfo['Content-Encoding'] == "gzip")) :
            respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);
            #print "+++ after unzip, len(respHtml)=",len(respHtml);

    return respHtml;

 

【总结】

关于给python中的urllib2.urlopen添加gzip支持,其中主要逻辑就是:

1. 给request添加对应的gzip的header:

req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

2. 然后获得返回的html后,用zlib对其解压缩:

respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);

其中解压缩之前,先要判断返回的内容,是否是真正的gzip后的数据,即“Content-Encoding: gzip”,因为可能出现你的http的请求中支持其返回gzip的数据,但是其返回的是原始的没有用gzip压缩的html数据。

转载请注明:在路上 » 【已解决】给Python中通过urllib2.urlopen获取网页的过程中,添加gzip的压缩与解压缩支持

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

网友最新评论 (2)

  1. 非常感谢,总结得很全面,避免了许多弯路。谢谢。 btw,补充一下为什么不能直接用gzip的原因:gzip主要是在解压过程中需要在文件的不同位置中来回读取数据,所以不能拿来直接解压数据流。
    soulstone10年前 (2014-11-05)回复
85 queries in 0.168 seconds, using 22.10MB memory