最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【暂时解决】给PySpider中用科学上网的代理打开需要翻墙的页面

页面 crifan 1164浏览 0评论
折腾:
【记录】用PySpider去爬取scholastic的绘本书籍数据
期间,加载页面偶尔异常不返回数据:
[I 181010 15:45:25 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
[I 181010 15:46:22 tornado_fetcher:188] [200] ScholasticStorybook:data:,on_start data:,on_start 0s
[I 181010 15:46:25 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
console: Error, missing Report Suite ID in AppMeasurement initialization
Error: Unexpected token '}'
Function@[native code]
compile@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:213:122
parse@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:238:288
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:117:343
$watch@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:127:350
link@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:168:435
ea@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:73:294
D@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:62:192
g@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:55:106
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:54:250
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:56:80
k@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:60:378
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:254:336
$digest@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:131:151
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core/scripts/adrum.js:14:478
$apply@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:134:85
g@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:87:450
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core/scripts/adrum.js:14:478
T@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:92:51
onload@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:93:79
 <a href="#" class="s-page back-to-top-link pagination-arrow-right ng-scope" ng-if="!SearchResults.finalPage" ng-click="!SearchResults.loading &amp;&amp; SearchResults.loadNextPage();pageChange();" ng-class="{'disabled':SearchResults.finalPage, 'disabled-link':SearchResults.finalPage,'loading':SearchResults.loading,'disabled-link':SearchResults.actualPage==SearchResults.lastPage}" target="_self">

https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:107
console: [object Object]
Request error: #65 [5=Operation canceled]
https://scholasticinc.tt.omtrdc.net/m2/scholasticinc/mbox/json?mbox=target-global-mbox&mboxSession=b119c34814ce4b5f8b3a2795c8c09526&mboxPC=&mboxPage=5b873fda78cd40208a278c1e41c26ac9&mboxVersion=1.1.0&mboxCount=1&mboxTime=1539186388962&mboxHost=www.scholastic.com&mboxURL=https%3A%2F%2Fwww.scholastic.com%2Fteachers%2Fbookwizard%2F&mboxReferrer=&browserHeight=2304&browserWidth=1024&browserTimeOffset=480&screenHeight=768&screenWidth=1024&colorDepth=32&vst.trk=stats.scholastic.com&vst.trks=sstats.scholastic.com&mboxMCSDID=6EE7248D918315FD-6DFE09D5B6547ECF&teachersbetaCutover=false&SPS_ID=not+logged+in
console: AT: [getOffer()] request failed [object Object]
console: AT: Rendering mbox failed target-global-mbox error timeout
[304] 
https://www.scholastic.com/teachers/bookwizard/
 20.146
[I 181010 15:46:44 tornado_fetcher:520] [304] ScholasticStorybook:34b1c45f09fa84805dd1697c1809e8c9 
https://www.scholastic.com/teachers/bookwizard/
 20.15s
偶然又可以:
而开了科学上网的浏览器打开页面是没问题的。
所以希望去加上代理,看看是否可以保证每次都能正常打开页面。
pyspider 添加代理
pyspider的代理使用的问题 – SegmentFault 思否
好像是可以直接给crawl设置proxy?
pyspider设置crawl_config代理服务器无效 – SegmentFault 思否
或者配置到全局的crawl_config?
求助 关于 pyspider 使用多代理 – V2EX
怎么设置代理,详情见邮件 – Google Groups
pyspider配置带验证的squid代理池 – 简书
self.crawl – pyspider中文文档 – pyspider中文网
PySpider proxy
self.crawl – pyspider
“proxy
proxy server of username:password@hostname:port to use, only http proxy is supported currently.
class Handler(BaseHandler):
    crawl_config = {
        ‘proxy’: ‘localhost:8080’
    }
Handler.crawl_config can be used with proxy to set a proxy for whole project.”
去试试ss的
crawl_config = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    "proxy": "127.0.0.1:1087",
}
结果
还是打开页面出错:
[304] 
https://www.scholastic.com/teachers/bookwizard/
 13.279
[I 181010 15:53:32 tornado_fetcher:520] [304] ScholasticStorybook:34b1c45f09fa84805dd1697c1809e8c9 
https://www.scholastic.com/teachers/bookwizard/
 13.28s
[I 181010 15:53:47 tornado_fetcher:188] [200] ScholasticStorybook:data:,on_start data:,on_start 0s
console: Error, missing Report Suite ID in AppMeasurement initialization
Error: Unexpected token '}'
Function@[native code]
compile@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:213:122
parse@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:238:288
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:117:343
$watch@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:127:350
link@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:168:435
ea@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:73:294
D@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:62:192
g@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:55:106
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:54:250
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:56:80
k@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:60:378
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:254:336
$digest@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:131:151
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core/scripts/adrum.js:14:478
$apply@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:134:85
g@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:87:450
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core/scripts/adrum.js:14:478
T@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:92:51
onload@
https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:93:79
 <a href="#" class="s-page back-to-top-link pagination-arrow-right ng-scope" ng-if="!SearchResults.finalPage" ng-click="!SearchResults.loading &amp;&amp; SearchResults.loadNextPage();pageChange();" ng-class="{'disabled':SearchResults.finalPage, 'disabled-link':SearchResults.finalPage,'loading':SearchResults.loading,'disabled-link':SearchResults.actualPage==SearchResults.lastPage}" target="_self">

https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:107
console: [object Object]
Request error: #80 [202=Error downloading 
https://shop.pe/widget/main/init/params?siteid=59d3b490d559308d854e75a8&product=Book
 Wizard%3A Teachers%2C Find and Level Books for Your Classroom %7C Scholastic&product_url=http%3A%2F%
2Fwww.scholastic.com%2Fteachers%2Fbookwizard%2F&image=http%3A%2F%2Fwww.scholastic.com%2F%2F&price=¤cy=undefined&rating=0&rating_count=0&review_count=0&stock_status=&description=Level
 your classroom library or find books at just the right level for students with Book Wizard%2C the book finder from Scholastic with Guided Reading%2C Lexile® Measure%2C an&update_product=true&subcategory=&url=https%3A%2F%
2Fwww.scholastic.com%2Fteachers%2Fbookwizard%2F&callback=AddShoppersWidget.load_widget&no_cookie_callback=AddShoppersWidget.load_no_cookie&rand=85958&cookie=&referer=
 - server replied: Forbidden]
https://shop.pe/widget/main/init/params?siteid=59d3b490d559308d854e75a8&product=Book%20Wizard%3A%20Teachers%2C%20Find%20and%20Level%20Books%20for%20Your%20Classroom%20%7C%20Scholastic&product_url=http%3A%2F%2Fwww.scholastic.com%2Fteachers%2Fbookwizard%2F&image=http%3A%2F%2Fwww.scholastic.com%2F%2F&price=¤cy=undefined&rating=0&rating_count=0&review_count=0&stock_status=&description=Level%20your%20classroom%20library%20or%20find%20books%20at%20just%20the%20right%20level%20for%20students%20with%20Book%20Wizard%2C%20the%20book%20finder%20from%20Scholastic%20with%20Guided%20Reading%2C%20Lexile%C2%AE%20Measure%2C%20an&update_product=true&subcategory=&url=https%3A%2F%2Fwww.scholastic.com%2Fteachers%2Fbookwizard%2F&callback=AddShoppersWidget.load_widget&no_cookie_callback=AddShoppersWidget.load_no_cookie&rand=85958&cookie=&referer=
[304] 
https://www.scholastic.com/teachers/bookwizard/
 15.054
[I 181010 15:54:04 tornado_fetcher:520] [304] ScholasticStorybook:34b1c45f09fa84805dd1697c1809e8c9 
https://www.scholastic.com/teachers/bookwizard/
 15.05s
全局翻墙试试:
直接报错error:
放弃全局翻墙。
看到:
http://docs.pyspider.org/en/latest/apis/self.crawl/#validate_cert
“validate_cert
For HTTPS requests, validate the server’s certificate? default: True”
难道此处和https的证书验证有关系?
另外去搜:
PySpider Error, missing Report Suite ID in AppMeasurement initialization
“Error, missing Report Suite ID in AppMeasurement initialization” is displayed in Browser Console
Error, missing Report Suite ID in AppMeasuremen… | Adobe Community
没找到相关的。去看看
不过先去看看:
【基本解决】PySpider打开页面出现304
此处为了确认上述代理是否生效,故意随便改动了端口,结果发现:
还是可以打开页面(虽然问题依旧)
-》证明了前面的:
proxy是无效的。
pyspider proxy not work
Proxy settings · Issue #524 · binux/pyspider
换成:
# "proxy": "127.0.0.1:10870",
"proxy": "localhost:1087",
结果好像成功率高很多。
后来经过测试是:
【总结】
PySpider中,网络请求,貌似是走的当前(Mac本地)系统的网络的:
  • Mac本身,用了ss代理,则PySpider可以正常打开youtube等(需要翻墙的)网站
    • 即使PySpider本身没有设置代理:
    crawl_config = {
        # "proxy": "127.0.0.1:10870",
        # "proxy": "127.0.0.1:1087",
        # "proxy": "localhost:1087",
    }
所以感觉是:
在此处Mac本地开启了ss代理的前提下,暂时,不需要,且开启了PySpider中proxy也没用
所以对于,PySpider中能访问翻墙的网站,
在此处Mac本地已开启ss的前提下,暂时算是解决了。
如果还有其他问题,到时候再说。

转载请注明:在路上 » 【暂时解决】给PySpider中用科学上网的代理打开需要翻墙的页面

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
103 queries in 0.211 seconds, using 22.12MB memory