最新消息:20190717 VPS服务器:Vultr新加坡,WordPress主题:大前端D8,统一介绍入口:关于

【记录】折腾Scrapy的Tutorial

Scrapy crifan 2889浏览 0评论

安装了Scrapy之后,就去按照官网教程:

Scrapy Tutorial

去试试。


1.通过

scrapy startproject tutorial

创建了一个新项目。

2.参考其代码,把items.py改为其所说的值。

3.新建了dmoz_spider.py,写上教程中所给的代码。

但是接下来,很悲催的是,教程中,居然没有说明“dmoz/spiders”中的dmoz,是位于什么位置,又是何时创建的文件夹。

实在不行,只有自己先去试试了。

先在和scrapy.cfg和tutorial文件夹同级的位置,建立了一个dmoz,然后在其下建立spiders文件夹,把dmoz_spider.py放进去。

然后去运行,结果出错了:

E:\Dev_Root\python\Scrapy>cd tutorial

E:\Dev_Root\python\Scrapy\tutorial>scrapy crawl dmoz
2012-11-11 19:47:27+0800 [scrapy] INFO: Scrapy 0.16.2 started (bot: tutorial)
2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats
, SpiderState
2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware,
UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMidd
leware, UrlLengthMiddleware, DepthMiddleware
2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled item pipelines:
Traceback (most recent call last):
  File "E:\dev_install_root\Python27\lib\runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "E:\dev_install_root\Python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py", line 156, in <module>
    execute()
  File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py", line 131, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py", line 76, in _run_print_help
    func(*a, **kw)
  File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py", line 138, in _run_command
    cmd.run(args, opts)
  File "E:\dev_install_root\Python27\lib\site-packages\scrapy\commands\crawl.py", line 43, in run
    spider = self.crawler.spiders.create(spname, **opts.spargs)
  File "E:\dev_install_root\Python27\lib\site-packages\scrapy\spidermanager.py", line 43, in create
    raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: dmoz'

坑爹的教程啊,很明显没有把路径解释清楚。

后来参考:

scrapy newbie: tutorial. error when running scrapy crawl dmoz

然后把dmoz_spider.py放到tutorial/tutorial/spiders下面,然后重新运行,就可以了:

E:\Dev_Root\python\Scrapy\tutorial>scrapy crawl dmoz
2012-11-11 19:51:40+0800 [scrapy] INFO: Scrapy 0.16.2 started (bot: tutorial)
2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats
, SpiderState
2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware,
UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMidd
leware, UrlLengthMiddleware, DepthMiddleware
2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled item pipelines:
2012-11-11 19:51:40+0800 [dmoz] INFO: Spider opened
2012-11-11 19:51:40+0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-11-11 19:51:40+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-11-11 19:51:40+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-11-11 19:51:41+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Boo
ks/> (referer: None)
2012-11-11 19:51:41+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Res
ources/> (referer: None)
2012-11-11 19:51:41+0800 [dmoz] INFO: Closing spider (finished)
2012-11-11 19:51:41+0800 [dmoz] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 530,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 13061,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2012, 11, 11, 11, 51, 41, 506000),
         'log_count/DEBUG': 8,
         'log_count/INFO': 4,
         'response_received_count': 2,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2012, 11, 11, 11, 51, 40, 630000)}
2012-11-11 19:51:41+0800 [dmoz] INFO: Spider closed (finished)

Scrapy这个项目,貌似文档方面,还是做的很不到位啊。

连最基本的这个教程,竟然路径方面都解释的很不清楚,让人产生混淆。真的很假。。。

4.后来,就是继续安装教程所给的代码,去测试了一下,最后的一次是通过代码dmoz_spider.py:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from tutorial.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["dmoz.org"]
   start_urls = [
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//ul/li')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.select('a/text()').extract()
           item['link'] = site.select('a/@href').extract()
           item['desc'] = site.select('text()').extract()
           items.append(item)
       return items

运行:

scrapy crawl dmoz -o items.json -t json

获得了输出的items.json:

[{"desc": ["\n                "], "link": ["/"], "title": ["Top"]},
{"desc": [], "link": ["/Computers/"], "title": ["Computers"]},
{"desc": [], "link": ["/Computers/Programming/"], "title": ["Programming"]},
{"desc": [], "link": ["/Computers/Programming/Languages/"], "title": ["Languages"]},
{"desc": [], "link": ["/Computers/Programming/Languages/Python/"], "title": ["Python"]},
{"desc": ["\n                  \t", "\u00a0", "\n                "], "link": [], "title": []},
{"desc": ["\n                        ", " \n                        ", "\n                    "], "link": ["/Computers/Programming/Languages/Python/Resources/"], "title": ["Computers: Programming: Languages: Python: Resources"]},
...
]

【总结】

貌似大概看了下其给出的一些链接,貌似Scrapy,功能还是很强大的。

剩下的,就是有空再去看看

Scrapy 0.17 documentation

其中有几乎所有的内容,值得折腾折腾。

转载请注明:在路上 » 【记录】折腾Scrapy的Tutorial

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

网友最新评论 (2)

  1. 第3点,你自己没有看完教程吧。
    ty100867年前 (2012-11-15)回复
    • 我是看完了啊,教程里面只说了: 把dmoz_spider.py存放到dmoz/spiders 但是并没有解释,dmoz这个文件夹,是从哪里来的,应该在什么位置创建该文件夹。 请问: 你看完了整个教程了? 看到哪里解释了,应该在什么位置创建dmoz(以及其下的子文件夹spiders)的??
      crifan7年前 (2012-11-15)回复
73 queries in 0.236 seconds, using 18.85MB memory