【记录】用Scrapy抓取manta.com

【问题】

手上有个Scrapy的项目,是要抓取和

http://www.manta.com/

相关的站点的内容。

已有源码为:

 

bs.py:

import requests
from bs4 import BeautifulSoup

seed_url="http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama"

r=requests.get(seed_url)
soup=BeautifulSoup(r.text)

urlls=soup.find("a","url")

for url in urls:
	href=url.get("href")

	r2=requests.get(href)
	soup2=BeautifulSoup(r2.text)

	

scrapy.cfg:

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# http://doc.scrapy.org/topics/scrapyd.html

[settings]
default = manta.settings

[deploy]
#url = http://localhost:6800/
project = manta

 

items.py:

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class MantaItem(Item):
    # define the fields for your item here like:
    # name = Field()
    pass

pipelines.py:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/topics/item-pipeline.html

class MantaPipeline(object):
    def process_item(self, item, spider):
        return item

 

settings.py:

# Scrapy settings for manta project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/topics/settings.html
#

#BOT_NAME = 'manta'

SPIDER_MODULES = ['manta.spiders']
NEWSPIDER_MODULE = 'manta.spiders'

BOT_NAME = 'EchO!/2.0'

DOWNLOAD_TIMEOUT = 15
DOWNLOAD_DELAY = 2
COOKIES_ENABLED = True
COOKIES_DEBUG = True
RETRY_ENABLED = False


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'manta (+http://www.yourdomain.com)'
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1"

DEFAULT_REQUEST_HEADERS={
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'X-JAVASCRIPT-ENABLED': 'true',
}

DOWNLOADER_MIDDLEWARES = {
	'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware':700,
}




COOKIES_DEBUG=True

很明显,核心代码是settings.py中的配置。

另外还有一些文件,总的文件架构为:

scrapy manta file structure

对应的,已有的返回内容responsebody,另存为html打开后,内容为:

Oops.

Before you can move on, please activate your browser cookies.

Incident Id: 51880fa5aa300

即,没有正常获取到:

http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama

的网页内容html的。

 

【解决过程】

1.得先参考:

【记录】折腾Scrapy的Tutorial

搞清楚如何运行

 

2. 看起来,像是settings.py的配置有误,所以先去用IE9的F12调试看看本身的逻辑:

accept is text html for manta

再去尝试改settings.py,结果都不行。

但是注意到,对应的代码:

COOKIES_ENABLED = True

#COOKIES_ENABLED = False

COOKIES_DEBUG = True

运行的结果是:

2013-05-24 23:32:58+0800 [mantaspider] DEBUG: Received cookies from: <200 http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama>

        Set-Cookie: SPSI=e760b4733042a6a1291db3b406fe8bfb ; path=/; domain=.manta.com

2013-05-24 23:32:58+0800 [mantaspider] DEBUG: Crawled (200) <GET http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama> (referer: http://www.manta.com)

即开始访问:

http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama

只返回了一个cookie:SPSI

对此,经过调试发现,其实IE9,也是同样的效果,但是由于返回的内容中,包含有:

<script type="text/javascript">

    oJSOffMSG = document.getElementById(‘JSOffMSG’); oJSOffMSG.style.display = ‘none’; try{ y=unescape(sbbvscc.replace(/^<\!\-\-\s*|\s*\-\->$/g,”)); document.getElementById(‘sbbhscc’).innerHTML=y; x=unescape(sbbgscc.replace(/^<\!\-\-\s*|\s*\-\->$/g,”)); } catch(e){ x=’function genPid() {return "jser"; }’; } document.write (‘<‘+’script type="text/javascri’+’pt">’+x+’ redirect("reload");</’+’script>’);

</script>

所以IE9浏览器中,会执行对应的reload,所以会刷新页面,重新打开:

http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama

然后就可以获得上面所看到的正常的网页的内容了。

对应的调试结果就是上述的逻辑:

第一次,也只是获得了单个的cookie:

first only one cookie

其中html中包含了reload:

first is redirect reload

 

第二次,通过刷新:

second is refresh

获得了真正页面的html:

refresh got real html

而对于如此的访问url的逻辑:

需要针对:

http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama

访问两次才可以的逻辑,貌似Scrapy中,很难实现啊。

3.参考:

Scrapy错误处理meta中的refresh指令

去添加上RedirectMiddleware试试:

DOWNLOADER_MIDDLEWARES = {
	'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
	'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware':700,
}

结果是,错误依旧。

4.参考:

Capturing http status codes with scrapy spider

貌似是可以通过自定义redirect的方式,去实现页面跳转的,但是现在还不太会。

5.截止目前,代码改的乱七八糟,如下:

# Scrapy settings for manta project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/topics/settings.html
#

#BOT_NAME = 'manta'

SPIDER_MODULES = ['manta.spiders']
NEWSPIDER_MODULE = 'manta.spiders'

BOT_NAME = 'EchO!/2.0'

DOWNLOAD_TIMEOUT = 15
DOWNLOAD_DELAY = 2
COOKIES_ENABLED = True
#COOKIES_ENABLED = False
COOKIES_DEBUG = True
#RETRY_ENABLED = False
RETRY_ENABLED = True

REDIRECT_ENABLED = True

METAREFRESH_ENABLED = True

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'manta (+http://www.yourdomain.com)'
#USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1"
USER_AGENT = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)";
	
	
DEFAULT_REQUEST_HEADERS={
    #'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
	'Accept': 'text/html, application/xhtml+xml, */*',
    #'Accept-Language': 'en',
	'Accept-Language': 'en-US',
    #'X-JAVASCRIPT-ENABLED': 'true',
	"Cache-Control":"no-cache",
	"Connection": "Keep-Alive",
	"UA-CPU":"AMD64",
	"Accept-Encoding":"gzip, deflate",
	"Referer":"http://www.manta.com",
		
}

DOWNLOADER_MIDDLEWARES = {
	'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
	#'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
	'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware':700,
	'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
}


#COOKIES_DEBUG=True

还是没工作。

 

【总结】

Scrapy,还是足够复杂,对于某url返回的js中带redirect的事情,估计还是可以用middleware实现的,只是现在自己不知道如何实现。



发表评论

电子邮件地址不会被公开。 必填项已用*标注

无觅相关文章插件,快速提升流量