如何去分析百度空间的html源码,以得知其是如何去获得一个帖子的评论内容的 v2011-12-19
Analyze get the comment for blog item of baidu space v2011-12-19
关于分析html源码之前的过程,参考这里:
【已实现】想要通过python脚本实现抓取百度空间上的文章,评论,图片 v2011-12-19
http://againinput4.blog.163.com/blog/static/1727994912011111153833643/
以这个百度空间的博客帖子:
http://hi.baidu.com/recommend_music/blog/item/5fe2e923cee1f55e93580718.html
(注意:暂时以非登陆模式,查看该页面)
为例。
分析百度空间中,如何通过AJAX,javascript去获得帖子的评论内容的。
1 整个页面框架
<script> … /*******Start Working!*******/ var _pageEndTime=(new Date().getTime()) – g_pageTimerStart;
baidu.on(window, ‘load’, function(){ var _pageLoadTime=(new Date().getTime()) – g_pageTimerStart; BdUtil.hi_trackerLink(‘m_20101109_blogloadtime’, _pageLoadTime+‘|’+_pageEndTime); });
/**********Step2***********/ _Space.pageDone.ok();
baidu.G(‘share_user_list’) && _Space.SectionLoad.add(‘share_user_list’, _Space.BlogSharedUser.getData); baidu.G(‘in_related_tmp’) && _Space.RelatedDoc && _Space.SectionLoad.add(‘in_related_tmp’, _Space.RelatedDoc.show); baidu.G(‘in_reader’) && _Space.SectionLoad.add(‘in_reader’, _Space.latestReader.show); baidu.G(‘blogOpt’) && _Space.SectionLoad.add(‘blogOpt’, _Space.commentOperate.view); </script> |
这部分的脚本,显示去加载页面,然后加载完毕pageDone之后,就开始加载对应的内容了:
share_user_list:对应的是此页面有哪些人分享了
in_related_tmp:对应的是于此页面相关的有哪些文章
in_reader:此文的读者有哪些
blogOpt:博客的可选Optional内容?,此处对应的是,剩下的Optional的,“网友评论“及“发表评论”的部分。
此处blogOpt就是我们要分析的部分了。
2 blogOpt
baidu.G(‘blogOpt’) && _Space.SectionLoad.add(‘blogOpt’, _Space.commentOperate.view); |
对应可以看出是baidu.G(‘blogOpt’)再去&&上_Space.SectionLoad.add(‘blogOpt’, _Space.commentOperate.view)
而_Space的SectionLoad.add函数是:
function add(mod, loadCallback){ var mod = baidu.G(mod); sectioinList.push({“mod”: mod, “callback”:loadCallback}); } |
所以执行结果是:
sectioinList.push({“mod”: ‘blogOpt’, “callback”: _Space.commentOperate.view});
所以总的结果就是:
baidu.G(‘blogOpt’) && sectioinList.push({“mod”: ‘blogOpt’, “callback”: _Space.commentOperate.view});
而其中_Space.commentOperate的view函数是:
var view = function(page){ var threadId =“5fe2e923cee1f55e93580718”; page = page || 0; var obj={“thread_id_enc”:threadId,“callback”:“_Space.commentOperate.viewCallBack”,“start”:page*cmtNumPerPage,“count”:cmtNumPerPage,“orderby_type”:0}; nowPage = page; comment.view(obj); }; |
var obj={“thread_id_enc”:threadId,”callback”:”_Space.commentOperate.viewCallBack”,”start”:page*cmtNumPerPage,”count”:cmtNumPerPage,”orderby_type”:0};
从源码中可以找到:
http://hi.bdimg.com/static/base/js/conf/comment.js?v=2ccd3457.js
其中可以找到对应的Space.Pubs.Comment的view函数,如下:
( function(c) { function b(d) { var e=d||(App.Domain.space+”/cmt/share/”); this.viewUrl=e+”get_thread?asyn=1&”; this.addUrl=e+”add_cmt”; this.delUrl=e+”delete_cmt”; this.getListContUrl=e+”batch_count?asyn=1&” }
var a;
b.prototype= { view : function(f){ var e = baidu.url.jsonToQuery(f); var d = this.viewUrl+e+”&t=”+Math.random(); baidu.sio.callByBrowser(d) },
addComment : function(d){a=a?a:new Space.Libs.IframeLoader();a.request(this.addUrl,d)}, delComment : function(d){a=a?a:new Space.Libs.IframeLoader();a.request(this.delUrl,d)}, getListCont : function(f){ var e = baidu.url.jsonToQuery(f); var d = this.getListContUrl+e+”&t=”+Math.random(); baidu.sio.callByBrowser(d)} };
c.Comment=b } )(Space.Pubs); |
baidu.url.jsonToQuery的函数作用参考这里:
http://tangram.baidu.com/api.html#baidu.url.jsonToQuery
所以上面源码中:
var threadId =”5fe2e923cee1f55e93580718″; … obj={“thread_id_enc”:threadId,“callback”:“_Space.commentOperate.viewCallBack”,“start”:page*cmtNumPerPage,“count”:cmtNumPerPage,“orderby_type”:0}; 。。。 comment.view(obj); |
其中page=0
cmtNumPerPage=50
的结果就是:
thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0
->
e= thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0
以及:
d = this.viewUrl+e+”&t=”+Math.random()
->
d =
this.viewUrl
+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0
+”&t=”
+Math.random()
而对于viewUrl上面的comment.js已经列出来了,所以
->
d =
e+”get_thread?asyn=1&”
+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0
+”&t=”
+Math.random()
->
d =
e+”get_thread?asyn=1&”
+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0
+”&t=”
+Math.random()
->
d =
d||(App.Domain.space+”/cmt/share/”) +”get_thread?asyn=1&”
+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0
+”&t=”
+Math.random()
->
d =
d||(App.Domain.space+”/cmt/share/”) +”get_thread?asyn=1&”
+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0
+”&t=”
+Math.random()
而其中的d,是传入的参数。
对于:
d||(App.Domain.space+”/cmt/share/”)
表示,如果d为真,那么结果就是d,如果d为假(空),那么结果就是
(App.Domain.space+”/cmt/share/”)
而此处的传入的d参数,是在html源码中的看到的,是有值的:
d
=Session.spSpaceDomain+”/cmt/spcmt/”
=”http://hi.baidu.com” + “/cmt/spcmt/”
= “http://hi.baidu.com/cmt/spcmt/”
所以,此处结果是
d=”http://hi.baidu.com/cmt/spcmt/”
->
d =
“http://hi.baidu.com/cmt/spcmt/” +”get_thread?asyn=1&”
+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0
+”&t=”
+Math.random()
->
d =
“http://hi.baidu.com/cmt/spcmt/get_thread?asyn=1&”
+thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0
+”&t=”
+Math.random()
= http://hi.baidu.com/cmt/spcmt/get_thread?asyn=1&thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0&t=Math.random()
而Math.random(),参考:
http://www.dreamdu.com/javascript/Math.random/
得知是返回0到1之间的伪随机数,比如
0.8628305946960314
0.23076181972948556
0.009787440411622494
此处的目的是为了防止第一次返回值之后,再次调用,被cache了,而random的每一次的值都不一样,所以保证每一次都是从服务器获得最新的数据。
->
http://hi.baidu.com/cmt/spcmt/get_thread?asyn=1&thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0&t=0.8628305946960314
在浏览器中运行上述地址,可以得到一个get_thread.js文件,其中包含了如下的数据:
(原先数据格式是乱的,已经被我手动整理了,方便看懂数据的内容结构)
_Space.commentOperate.viewCallBack (
{ “err_no”:0, “err_msg”:”success”, “total_count”:”3″, “response_count”:3, “err_desc”:”success”, “body”: { “total_count”:3, “real_ret_count”:3, “data”: [ { “reply_count”:0, “score”:0, “favor”:0, “is_top”:0, “like_count”:0, “dislike_count”:0, “create_time”:”1309433893″, “user_id”:”240697217″, “user_name”:”cyansala”, “user_ip”:”112.227.93.191″, “area”:””,”title”:””, “content”:”<span >= =抱歉,你已经太久没更新博客了……</span>”, “reserved1”:0, “reserved2”:0, “mdatetime”:”1309433893″, “cdatetime”:1309433893, “un”:”cyansala”, “reply_id_enc”:”0fb30f242e9e5f3cc895592c”, “thread_id_enc”:”6c81800a0a6d1f2eb1351da1″, “parent_id_enc”:”ab64034f78f0f736afc3ab64″, “portrait”:”81bf6379616e73616c61580e”, “sexy_time”:”6-30 19:38″ },
{ “reply_count”:0, “score”:0, “favor”:0, “is_top”:0, “like_count”:0, “dislike_count”:0, “create_time”:”1323067492″, “user_id”:”39390080″, “user_name”:”againinput6″, “user_ip”:”58.240.236.19″, “area”:””, “title”:””, “content”:”回郜yansala:新博客:http://blog.163.com/fun_everyday/blog/#m=0,中“music”分类下,有最新推荐的一些歌。”, “reserved1”:0, “reserved2”:0, “mdatetime”:”1323067492″, “cdatetime”:1323067492, “un”:”againinput6″, “reply_id_enc”:”82025aaf538f4febfbed50f4″, “thread_id_enc”:”6c81800a0a6d1f2eb1351da1″, “parent_id_enc”:”ab64034f78f0f736afc3ab64″, “portrait”:”800b616761696e696e707574365902″, “sexy_time”:”12-5 14:44″ },
{ “reply_count”:0, “score”:0, “favor”:0, “is_top”:0, “like_count”:0, “dislike_count”:0, “create_time”:”1323150747″, “user_id”:”240697217″, “user_name”:”cyansala”, “user_ip”:”112.227.85.184″, “area”:””, “title”:””, “content”:”搬?去了呀<img src=”http://img.baidu.com/hi/jx2/j_0002.gif”>163挺好的<br>”, “reserved1”:0, “reserved2”:0, “mdatetime”:”1323150747″, “cdatetime”:1323150747, “un”:”cyansala”, “reply_id_enc”:”f8dcd1007383c801728b6556″, “thread_id_enc”:”6c81800a0a6d1f2eb1351da1″, “parent_id_enc”:”ab64034f78f0f736afc3ab64″, “portrait”:”81bf6379616e73616c61580e”, “sexy_time”:”12-6 13:52″ } ] }
}
); |
对应的,去掉那个callback后,在python中运行如下代码:
#cmt_url =”http://hi.baidu.com/cmt/spcmt/get_thread?asyn=1&thread_id_enc=5fe2e923cee1f55e93580718&callback=_Space.commentOperate.viewCallBack&start=0&count=50&orderby_type=0&t=0.2307618197294″; cmt_url =”http://hi.baidu.com/cmt/spcmt/get_thread?asyn=1&thread_id_enc=5fe2e923cee1f55e93580718&start=0&count=50&orderby_type=0&t=0.2307618197294″; cmt_req = urllib2.Request(cmt_url) cmt_page = urllib2.build_opener().open(cmt_req).read() cmt_soup = BeautifulSoup(cmt_page) logging.debug(“Got commentn—————n%s”,cmt_soup) logging.debug(“—————n”) |
打印出的数据如下:
{“err_no”:0,”err_msg”:”success”,”total_count”:”3″,”response_count”:3,”err_desc”:”success”,”body”:{“total_count”:3,”real_ret_count”:3,”data”:[{“reply_count”:0,”score”:0,”favor”:0,”is_top”:0,”like_count”:0,”dislike_count”:0,”create_time”:”1309433893″,”user_id”:”240697217″,”user_name”:”cyansala”,”user_ip”:”112.227.93.191″,”area”:””,”title”:””,”content”:”<span>= =抱歉,你已经太久没更新博客了……</span>”,”reserved1″:0,”reserved2″:0,”mdatetime”:”1309433893″,”cdatetime”:1309433893,”un”:”cyansala”,”reply_id_enc”:”0fb30f242e9e5f3cc895592c”,”thread_id_enc”:”6c81800a0a6d1f2eb1351da1″,”parent_id_enc”:”ab64034f78f0f736afc3ab64″,”portrait”:”81bf6379616e73616c61580e”,”sexy_time”:”6-30 19:38″},{“reply_count”:0,”score”:0,”favor”:0,”is_top”:0,”like_count”:0,”dislike_count”:0,”create_time”:”1323067492″,”user_id”:”39390080″,”user_name”:”againinput6″,”user_ip”:”58.240.236.19″,”area”:””,”title”:””,”content”:”回郜yansala:新博客:http://blog.163.com/fun_everyday/blog/#m=0,中“music”分类下,有最新推荐的一些歌。”,”reserved1″:0,”reserved2″:0,”mdatetime”:”1323067492″,”cdatetime”:1323067492,”un”:”againinput6″,”reply_id_enc”:”82025aaf538f4febfbed50f4″,”thread_id_enc”:”6c81800a0a6d1f2eb1351da1″,”parent_id_enc”:”ab64034f78f0f736afc3ab64″,”portrait”:”800b616761696e696e707574365902″,”sexy_time”:”12-5 14:44″},{“reply_count”:0,”score”:0,”favor”:0,”is_top”:0,”like_count”:0,”dislike_count”:0,”create_time”:”1323150747″,”user_id”:”240697217″,”user_name”:”cyansala”,”user_ip”:”112.227.85.184″,”area”:””,”title”:””,”content”:”搬?去了呀<img src=”” />163挺好的<br />”,”reserved1″:0,”reserved2″:0,”mdatetime”:”1323150747″,”cdatetime”:1323150747,”un”:”cyansala”,”reply_id_enc”:”f8dcd1007383c801728b6556″,”thread_id_enc”:”6c81800a0a6d1f2eb1351da1″,”parent_id_enc”:”ab64034f78f0f736afc3ab64″,”portrait”:”81bf6379616e73616c61580e”,”sexy_time”:”12-6 13:52″}]}} </span> |
为了便于看清数据结构,手动编辑后如下:
{ “err_no”:0, “err_msg”:”success”, “total_count”:”3″, “response_count”:3, “err_desc”:”success”, “body”: { “total_count”:3, “real_ret_count”:3, “data”: [ { “reply_count”:0, “score”:0, “favor”:0, “is_top”:0, “like_count”:0, “dislike_count”:0, “create_time”:”1309433893″, “user_id”:”240697217″, “user_name”:”cyansala”, “user_ip”:”112.227.93.191″, “area”:””, “title”:””, “content”:”<span>= =抱歉,你已经太久没更新博客了……</span>”, “reserved1”:0, “reserved2”:0, “mdatetime”:”1309433893″, “cdatetime”:1309433893, “un”:”cyansala”, “reply_id_enc”:”0fb30f242e9e5f3cc895592c”, “thread_id_enc”:”6c81800a0a6d1f2eb1351da1″, “parent_id_enc”:”ab64034f78f0f736afc3ab64″, “portrait”:”81bf6379616e73616c61580e”, “sexy_time”:”6-30 19:38″ },
{ “reply_count”:0, “score”:0, “favor”:0, “is_top”:0, “like_count”:0, “dislike_count”:0, “create_time”:”1323067492″, “user_id”:”39390080″, “user_name”:”againinput6″, “user_ip”:”58.240.236.19″, “area”:””, “title”:””, “content”:”回郜yansala:新博客:http://blog.163.com/fun_everyday/blog/#m=0,中“music”分类下,有最新推荐的一些歌。”, “reserved1”:0, “reserved2”:0, “mdatetime”:”1323067492″, “cdatetime”:1323067492, “un”:”againinput6″, “reply_id_enc”:”82025aaf538f4febfbed50f4″, “thread_id_enc”:”6c81800a0a6d1f2eb1351da1″, “parent_id_enc”:”ab64034f78f0f736afc3ab64″, “portrait”:”800b616761696e696e707574365902″, “sexy_time”:”12-5 14:44″ },
{ “reply_count”:0, “score”:0, “favor”:0, “is_top”:0, “like_count”:0, “dislike_count”:0, “create_time”:”1323150747″, “user_id”:”240697217″, “user_name”:”cyansala”, “user_ip”:”112.227.85.184″, “area”:””, “title”:””, “content”:”搬?去了呀<img src=”” />163挺好的<br />”, “reserved1”:0, “reserved2”:0, “mdatetime”:”1323150747″, “cdatetime”:1323150747, “un”:”cyansala”, “reply_id_enc”:”f8dcd1007383c801728b6556″, “thread_id_enc”:”6c81800a0a6d1f2eb1351da1″, “parent_id_enc”:”ab64034f78f0f736afc3ab64″, “portrait”:”81bf6379616e73616c61580e”, “sexy_time”:”12-6 13:52″ }
] } } </span> |
可以看到,beautifulSoup判断内容失误,在返回的内容最后添加了一个</span>
所以,不需要此处调用beautifulSoup,而本身返回的内容,就可以留作后续处理,找到对应的comment的详细信息,包括author,ip,content等重要内容了。
转载请注明:在路上 » 如何去分析百度空间的html源码,以得知其是如何去获得一个帖子的评论内容的 v2011-12-19