最新消息:20190717 VPS服务器:Vultr新加坡,WordPress主题:大前端D8,统一介绍入口:关于

如何去分析百度空间的html源码,以得知其是如何去获得一个帖子的评论内容的 v2011-12-19

Crawl_EmulateLogin crifan 1314浏览 0评论

如何去分析百度空间的html源码,以得知其是如何去获得一个帖子的评论内容的 v2011-12-19

Analyze get the comment for blog item of baidu space v2011-12-19

 


关于分析html源码之前的过程,参考这里:

 【已实现】想要通过python脚本实现抓取百度空间上的文章,评论,图片 v2011-12-19

http://againinput4.blog.163.com/blog/static/1727994912011111153833643/


以这个百度空间的博客帖子:

http://hi.baidu.com/recommend_music/blog/item/5fe2e923cee1f55e93580718.html

(注意:暂时以非登陆模式,查看该页面)

为例。

分析百度空间中,如何通过AJAXjavascript去获得帖子的评论内容的。

 

1      整个页面框架

 

<script>

 

        /*******Start Working*******/

        var _pageEndTime=(new Date().getTime()) g_pageTimerStart;

   

        baidu.on(window, ‘load’, function(){

            var _pageLoadTime=(new Date().getTime()) g_pageTimerStart;

            BdUtil.hi_trackerLink(‘m_20101109_blogloadtime’, _pageLoadTime+‘|’+_pageEndTime);

        });

   

        /**********Step2***********/

        _Space.pageDone.ok();

   

        baidu.G(‘share_user_list’) && _Space.SectionLoad.add(‘share_user_list’, _Space.BlogSharedUser.getData);

        baidu.G(‘in_related_tmp’) && _Space.RelatedDoc && _Space.SectionLoad.add(‘in_related_tmp’, _Space.RelatedDoc.show);

        baidu.G(‘in_reader’) && _Space.SectionLoad.add(‘in_reader’, _Space.latestReader.show);

        baidu.G(‘blogOpt’) && _Space.SectionLoad.add(‘blogOpt’, _Space.commentOperate.view);

</script>

 

这部分的脚本,显示去加载页面,然后加载完毕pageDone之后,就开始加载对应的内容了:

share_user_list:对应的是此页面有哪些人分享了

in_related_tmp:对应的是于此页面相关的有哪些文章

in_reader:此文的读者有哪些

blogOpt:博客的可选Optional内容?,此处对应的是,剩下的Optional的,“网友评论“及“发表评论”的部分。

 

此处blogOpt就是我们要分析的部分了。

 

2      blogOpt

baidu.G(‘blogOpt’) && _Space.SectionLoad.add(‘blogOpt’, _Space.commentOperate.view);

对应可以看出是baidu.G(‘blogOpt’)再去&&_Space.SectionLoad.add(‘blogOpt’, _Space.commentOperate.view)

 

_SpaceSectionLoad.add函数是:

                    function add(mod, loadCallback){

                        var mod = baidu.G(mod);

                        sectioinList.push({“mod”: mod, “callback”:loadCallback});

                    }

所以执行结果是:

sectioinList.push({“mod”: ‘blogOpt’, “callback”: _Space.commentOperate.view});

 

所以总的结果就是:

baidu.G(‘blogOpt’) && sectioinList.push({“mod”: ‘blogOpt’, “callback”: _Space.commentOperate.view});

 

而其中_Space.commentOperateview函数是:

var  view = function(page){

   var  threadId =“5fe2e923cee1f55e93580718”;

   page = page || 0;

   var obj={“thread_id_enc”:threadId,“callback”:“_Space.commentOperate.viewCallBack”,“start”:page*cmtNumPerPage,“count”:cmtNumPerPage,“orderby_type”:0};

   nowPage = page;

   comment.view(obj);

};

 

 

var obj={“thread_id_enc”:threadId,”callback”:”_Space.commentOperate.viewCallBack”,”start”:page*cmtNumPerPage,”count”:cmtNumPerPage,”orderby_type”:0};

 

从源码中可以找到:

http://hi.bdimg.com/static/base/js/conf/comment.js?v=2ccd3457.js

其中可以找到对应的Space.Pubs.Commentview函数,如下:

(

            function(c)

            {

                        function b(d)

                        {

                                    var e=d||(App.Domain.space+”/cmt/share/”);

                                    this.viewUrl=e+”get_thread?asyn=1&”;

                                    this.addUrl=e+”add_cmt”;

                                    this.delUrl=e+”delete_cmt”;

                                    this.getListContUrl=e+”batch_count?asyn=1&”

                        }

 

                        var a;

 

                        b.prototype=

                        {

                                    view                :           function(f){

                                                var e = baidu.url.jsonToQuery(f);

                                                var d = this.viewUrl+e+”&t=”+Math.random();

                                                baidu.sio.callByBrowser(d)

                                                },

 

                                    addComment           :           function(d){a=a?a:new Space.Libs.IframeLoader();a.request(this.addUrl,d)},                          

                                    delComment            :           function(d){a=a?a:new Space.Libs.IframeLoader();a.request(this.delUrl,d)},

                                    getListCont   :           function(f){

                                                var e = baidu.url.jsonToQuery(f);

                                                var d = this.getListContUrl+e+”&t=”+Math.random();

                                                baidu.sio.callByBrowser(d)}

                        };

 

                        c.Comment=b

            }

)(Space.Pubs);

baidu.url.jsonToQuery的函数作用参考这里:

http://tangram.baidu.com/api.html#baidu.url.jsonToQuery

 

所以上面源码中:

var  threadId =”5fe2e923cee1f55e93580718″;

obj={“thread_id_enc”:threadId,“callback”:“_Space.commentOperate.viewCallBack”,“start”:page*cmtNumPerPage,“count”:cmtNumPerPage,“orderby_type”:0};

   。。。

   comment.view(obj);

其中page=0

cmtNumPerPage=50

 

的结果就是:

thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0

->

e= thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0

以及:

d = this.viewUrl+e+”&t=”+Math.random()

->

d =

this.viewUrl

+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0

+”&t=”

+Math.random()

而对于viewUrl上面的comment.js已经列出来了,所以

->

d =

e+”get_thread?asyn=1&”

+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0

+”&t=”

+Math.random()

->

d =

e+”get_thread?asyn=1&”

+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0

+”&t=”

+Math.random()

->

d =

d||(App.Domain.space+”/cmt/share/”) +”get_thread?asyn=1&”

+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0

+”&t=”

+Math.random()

->

d =

d||(App.Domain.space+”/cmt/share/”) +”get_thread?asyn=1&”

+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0

+”&t=”

+Math.random()

而其中的d,是传入的参数。

对于:

d||(App.Domain.space+”/cmt/share/”)

表示,如果d为真,那么结果就是d,如果d为假(空),那么结果就是

(App.Domain.space+”/cmt/share/”)

而此处的传入的d参数,是在html源码中的看到的,是有值的:

d

=Session.spSpaceDomain+”/cmt/spcmt/”

=”http://hi.baidu.com” + “/cmt/spcmt/”

= “http://hi.baidu.com/cmt/spcmt/”

所以,此处结果是

d=”http://hi.baidu.com/cmt/spcmt/”

->

d =

“http://hi.baidu.com/cmt/spcmt/” +”get_thread?asyn=1&”

+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0

+”&t=”

+Math.random()

->

d =

“http://hi.baidu.com/cmt/spcmt/get_thread?asyn=1&”

+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0

+”&t=”

+Math.random()

= http://hi.baidu.com/cmt/spcmt/get_thread?asyn=1&thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0&t=Math.random()

 

 Math.random(),参考:

http://www.dreamdu.com/javascript/Math.random/

得知是返回01之间的伪随机数,比如

0.8628305946960314

0.23076181972948556

0.009787440411622494

 

 

此处的目的是为了防止第一次返回值之后,再次调用,被cache了,而random的每一次的值都不一样,所以保证每一次都是从服务器获得最新的数据。

->

http://hi.baidu.com/cmt/spcmt/get_thread?asyn=1&thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0&t=0.8628305946960314

 

在浏览器中运行上述地址,可以得到一个get_thread.js文件,其中包含了如下的数据:

(原先数据格式是乱的,已经被我手动整理了,方便看懂数据的内容结构)

_Space.commentOperate.viewCallBack

(

 

{

“err_no”:0,

“err_msg”:”success”,

“total_count”:”3″,

“response_count”:3,

“err_desc”:”success”,

“body”:

{

            “total_count”:3,

            “real_ret_count”:3,

            “data”:

            [

            {

                        “reply_count”:0,

                        “score”:0,

                        “favor”:0,

                        “is_top”:0,

                        “like_count”:0,

                        “dislike_count”:0,

                        “create_time”:”1309433893″,

                        “user_id”:”240697217″,

                        “user_name”:”cyansala”,

                        “user_ip”:”112.227.93.191″,

                        “area”:””,”title”:””,

                        “content”:”<span >= =抱歉,你已经太久没更新博客了……</span>”,

                        “reserved1”:0,

                        “reserved2”:0,

                        “mdatetime”:”1309433893″,

                        “cdatetime”:1309433893,

                        “un”:”cyansala”,

                        “reply_id_enc”:”0fb30f242e9e5f3cc895592c”,

                        “thread_id_enc”:”6c81800a0a6d1f2eb1351da1″,

                        “parent_id_enc”:”ab64034f78f0f736afc3ab64″,

                        “portrait”:”81bf6379616e73616c61580e”,

                        “sexy_time”:”6-30 19:38″

            },

 

            {

                        “reply_count”:0,

                        “score”:0,

                        “favor”:0,

                        “is_top”:0,

                        “like_count”:0,

                        “dislike_count”:0,

                        “create_time”:”1323067492″,

                        “user_id”:”39390080″,

                        “user_name”:”againinput6″,

                        “user_ip”:”58.240.236.19″,

                        “area”:””,

                        “title”:””,

                        “content”:”回郜yansala:新博客:http://blog.163.com/fun_everyday/blog/#m=0,中“music”分类下,有最新推荐的一些歌。”,

                        “reserved1”:0,

                        “reserved2”:0,

                        “mdatetime”:”1323067492″,

                        “cdatetime”:1323067492,

                        “un”:”againinput6″,

                        “reply_id_enc”:”82025aaf538f4febfbed50f4″,

                        “thread_id_enc”:”6c81800a0a6d1f2eb1351da1″,

                        “parent_id_enc”:”ab64034f78f0f736afc3ab64″,

                        “portrait”:”800b616761696e696e707574365902″,

                        “sexy_time”:”12-5 14:44″

            },

 

            {

                        “reply_count”:0,

                        “score”:0,

                        “favor”:0,

                        “is_top”:0,

                        “like_count”:0,

                        “dislike_count”:0,

                        “create_time”:”1323150747″,

                        “user_id”:”240697217″,

                        “user_name”:”cyansala”,

                        “user_ip”:”112.227.85.184″,

                        “area”:””,

                        “title”:””,

                        “content”:”搬?去了呀<img src=”http://img.baidu.com/hi/jx2/j_0002.gif”>163挺好的<br>”,

                        “reserved1”:0,

                        “reserved2”:0,

                        “mdatetime”:”1323150747″,

                        “cdatetime”:1323150747,

                        “un”:”cyansala”,

                        “reply_id_enc”:”f8dcd1007383c801728b6556″,

                        “thread_id_enc”:”6c81800a0a6d1f2eb1351da1″,

                        “parent_id_enc”:”ab64034f78f0f736afc3ab64″,

                        “portrait”:”81bf6379616e73616c61580e”,

                        “sexy_time”:”12-6 13:52″

            }

            ]

}

 

}

 

);

对应的,去掉那个callback后,在python中运行如下代码:

    #cmt_url =”http://hi.baidu.com/cmt/spcmt/get_thread?asyn=1&thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0&t=0.2307618197294″;

    cmt_url =”http://hi.baidu.com/cmt/spcmt/get_thread?asyn=1&thread_id_enc=5fe2e923cee1f55e93580718&start=0&count=50&orderby_type=0&t=0.2307618197294″;

    cmt_req = urllib2.Request(cmt_url)

    cmt_page = urllib2.build_opener().open(cmt_req).read()

    cmt_soup = BeautifulSoup(cmt_page)

    logging.debug(“Got commentn—————n%s”,cmt_soup)

    logging.debug(“—————n”)

 

打印出的数据如下:

{“err_no”:0,”err_msg”:”success”,”total_count”:”3″,”response_count”:3,”err_desc”:”success”,”body”:{“total_count”:3,”real_ret_count”:3,”data”:[{“reply_count”:0,”score”:0,”favor”:0,”is_top”:0,”like_count”:0,”dislike_count”:0,”create_time”:”1309433893″,”user_id”:”240697217″,”user_name”:”cyansala”,”user_ip”:”112.227.93.191″,”area”:””,”title”:””,”content”:”<span>= =抱歉,你已经太久没更新博客了……</span>”,”reserved1″:0,”reserved2″:0,”mdatetime”:”1309433893″,”cdatetime”:1309433893,”un”:”cyansala”,”reply_id_enc”:”0fb30f242e9e5f3cc895592c”,”thread_id_enc”:”6c81800a0a6d1f2eb1351da1″,”parent_id_enc”:”ab64034f78f0f736afc3ab64″,”portrait”:”81bf6379616e73616c61580e”,”sexy_time”:”6-30 19:38″},{“reply_count”:0,”score”:0,”favor”:0,”is_top”:0,”like_count”:0,”dislike_count”:0,”create_time”:”1323067492″,”user_id”:”39390080″,”user_name”:”againinput6″,”user_ip”:”58.240.236.19″,”area”:””,”title”:””,”content”:”回郜yansala:新博客:http://blog.163.com/fun_everyday/blog/#m=0,中“music”分类下,有最新推荐的一些歌。”,”reserved1″:0,”reserved2″:0,”mdatetime”:”1323067492″,”cdatetime”:1323067492,”un”:”againinput6″,”reply_id_enc”:”82025aaf538f4febfbed50f4″,”thread_id_enc”:”6c81800a0a6d1f2eb1351da1″,”parent_id_enc”:”ab64034f78f0f736afc3ab64″,”portrait”:”800b616761696e696e707574365902″,”sexy_time”:”12-5 14:44″},{“reply_count”:0,”score”:0,”favor”:0,”is_top”:0,”like_count”:0,”dislike_count”:0,”create_time”:”1323150747″,”user_id”:”240697217″,”user_name”:”cyansala”,”user_ip”:”112.227.85.184″,”area”:””,”title”:””,”content”:”搬?去了呀<img src=”” />163挺好的<br />”,”reserved1″:0,”reserved2″:0,”mdatetime”:”1323150747″,”cdatetime”:1323150747,”un”:”cyansala”,”reply_id_enc”:”f8dcd1007383c801728b6556″,”thread_id_enc”:”6c81800a0a6d1f2eb1351da1″,”parent_id_enc”:”ab64034f78f0f736afc3ab64″,”portrait”:”81bf6379616e73616c61580e”,”sexy_time”:”12-6 13:52″}]}}

</span>

 

为了便于看清数据结构,手动编辑后如下:

{

            “err_no”:0,

            “err_msg”:”success”,

            “total_count”:”3″,

            “response_count”:3,

            “err_desc”:”success”,

            “body”:

            {

                        “total_count”:3,

                        “real_ret_count”:3,

                        “data”:

                        [

                                    {

                                    “reply_count”:0,

                                    “score”:0,

                                    “favor”:0,

                                    “is_top”:0,

                                    “like_count”:0,

                                    “dislike_count”:0,

                                    “create_time”:”1309433893″,

                                    “user_id”:”240697217″,

                                    “user_name”:”cyansala”,

                                    “user_ip”:”112.227.93.191″,

                                    “area”:””,

                                    “title”:””,

                                    “content”:”<span>= =抱歉,你已经太久没更新博客了……</span>”,

                                    “reserved1”:0,

                                    “reserved2”:0,

                                    “mdatetime”:”1309433893″,

                                    “cdatetime”:1309433893,

                                    “un”:”cyansala”,

                                    “reply_id_enc”:”0fb30f242e9e5f3cc895592c”,

                                    “thread_id_enc”:”6c81800a0a6d1f2eb1351da1″,

                                    “parent_id_enc”:”ab64034f78f0f736afc3ab64″,

                                    “portrait”:”81bf6379616e73616c61580e”,

                                    “sexy_time”:”6-30 19:38″

                                    },

 

                                    {

                                    “reply_count”:0,

                                    “score”:0,

                                    “favor”:0,

                                    “is_top”:0,

                                    “like_count”:0,

                                    “dislike_count”:0,

                                    “create_time”:”1323067492″,

                                    “user_id”:”39390080″,

                                    “user_name”:”againinput6″,

                                    “user_ip”:”58.240.236.19″,

                                    “area”:””,

                                    “title”:””,

                                    “content”:”回郜yansala:新博客:http://blog.163.com/fun_everyday/blog/#m=0,中“music”分类下,有最新推荐的一些歌。”,

                                    “reserved1”:0,

                                    “reserved2”:0,

                                    “mdatetime”:”1323067492″,

                                    “cdatetime”:1323067492,

                                    “un”:”againinput6″,

                                    “reply_id_enc”:”82025aaf538f4febfbed50f4″,

                                    “thread_id_enc”:”6c81800a0a6d1f2eb1351da1″,

                                    “parent_id_enc”:”ab64034f78f0f736afc3ab64″,

                                    “portrait”:”800b616761696e696e707574365902″,

                                    “sexy_time”:”12-5 14:44″

                                    },

 

                                    {

                                    “reply_count”:0,

                                    “score”:0,

                                    “favor”:0,

                                    “is_top”:0,

                                    “like_count”:0,

                                    “dislike_count”:0,

                                    “create_time”:”1323150747″,

                                    “user_id”:”240697217″,

                                    “user_name”:”cyansala”,

                                    “user_ip”:”112.227.85.184″,

                                    “area”:””,

                                    “title”:””,

                                    “content”:”搬?去了呀<img src=”” />163挺好的<br />”,

                                    “reserved1”:0,

                                    “reserved2”:0,

                                    “mdatetime”:”1323150747″,

                                    “cdatetime”:1323150747,

                                    “un”:”cyansala”,

                                    “reply_id_enc”:”f8dcd1007383c801728b6556″,

                                    “thread_id_enc”:”6c81800a0a6d1f2eb1351da1″,

                                    “parent_id_enc”:”ab64034f78f0f736afc3ab64″,

                                    “portrait”:”81bf6379616e73616c61580e”,

                                    “sexy_time”:”12-6 13:52″

                                    }

 

                        ]

            }

}

</span>

可以看到,beautifulSoup判断内容失误,在返回的内容最后添加了一个</span>

所以,不需要此处调用beautifulSoup,而本身返回的内容,就可以留作后续处理,找到对应的comment的详细信息,包括author,ip,content等重要内容了。

 

转载请注明:在路上 » 如何去分析百度空间的html源码,以得知其是如何去获得一个帖子的评论内容的 v2011-12-19

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
46 queries in 0.135 seconds, using 19.11MB memory