不积跬步,无以至千里;不积小流,无以成江海。

Dean's blog

  • Join Us on Facebook!
  • Follow Us on Twitter!
  • LinkedIn
  • Subcribe to Our RSS Feed

Scrapy命令行工具速览

Scrapy提供了丰富的基于命令行工具,分为全局命令和项目命令。当不是在项目目录下执行scrapy时,返回的是全局命令。

全局命令

F:\>Scrapy
Scrapy 1.6.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

bench

用于快速基准测试,用于测试电脑可达到的最快抓取速度。例如:

F:\>scrapy bench
2019-12-29 23:35:22 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-12-29 23:35:22 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2019-12-29 23:35:23 [scrapy.crawler] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'}
2019-12-29 23:35:23 [scrapy.extensions.telnet] INFO: Telnet Password: 62eabe350420384d
2019-12-29 23:35:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.logstats.LogStats']
2019-12-29 23:35:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-12-29 23:35:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-12-29 23:35:24 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-12-29 23:35:24 [scrapy.core.engine] INFO: Spider opened
2019-12-29 23:35:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:24 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-12-29 23:35:25 [scrapy.extensions.logstats] INFO: Crawled 61 pages (at 3660 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:26 [scrapy.extensions.logstats] INFO: Crawled 109 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:27 [scrapy.extensions.logstats] INFO: Crawled 165 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:28 [scrapy.extensions.logstats] INFO: Crawled 221 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:29 [scrapy.extensions.logstats] INFO: Crawled 261 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:30 [scrapy.extensions.logstats] INFO: Crawled 301 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:31 [scrapy.extensions.logstats] INFO: Crawled 341 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:32 [scrapy.extensions.logstats] INFO: Crawled 373 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:33 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:34 [scrapy.extensions.logstats] INFO: Crawled 445 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:34 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2019-12-29 23:35:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 188301,
 'downloader/request_count': 461,
 'downloader/request_method_count/GET': 461,
 'downloader/response_bytes': 1244709,
 'downloader/response_count': 461,
 'downloader/response_status_count/200': 461,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 12, 29, 15, 35, 35, 95122),
 'log_count/INFO': 19,
 'request_depth_max': 16,
 'response_received_count': 461,
 'scheduler/dequeued': 461,
 'scheduler/dequeued/memory': 461,
 'scheduler/enqueued': 9221,
 'scheduler/enqueued/memory': 9221,
 'start_time': datetime.datetime(2019, 12, 29, 15, 35, 24, 264562)}
2019-12-29 23:35:35 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)

从输出可以看到,它一共发送了461个请求,抓取速度大约在1920~3660个页面每分钟。它这个抓取速度只能作为一个参考,一般我们正常的爬虫还需要提取数据、数据存储等大量操作,而且遇到一些反爬网站,还需以低速确证稳定,是达不到它的这个基准值的。

fetch

使用scrapy下载器下载页面并输出到标准输出,例如:

scrapy fetch http://www.mirthsoft.com

但一般页面的内容比较长,在命令行中并不能完全显示所有内容。

genspider

用于创建爬虫,这个在Scrapy快速预览也在使用,格式如下:

scrapy genspider [-t template] name domain

runspider

用于在项目目录外运行爬虫,这个与crawl命令类似,格式如下:

scrapy runspider [options] <spider_file>

 如果想运行Scrapy快速预览中创建的爬虫,可以这样子:

scrapy runspider F:/Codes/fastDemo/fastDemo/spiders/demo.py

这样执行的结果与使用crawl命令一致。

settings

用于获取和设置配置参数,具体用法还未确定,格式如下:

scrapy settings [options]

shell

用于交互式提取页面内容测试,是一个非常实用的命令,格式如下:

scrapy shell [url|file]

例如:

scrapy shell http://www.mirthsoft.com

这样可以在命令行中测试脚本并直接应用到爬虫代码中,这个后续会详细介绍。

startprojec

用于创建新的项目,这个在Scrapy快速预览也有应用,格式如下:

scrapy startproject <project_name> [project_dir]

version

用于查看Scrapy版本,格式如下:

scrapy version

view

将爬取到的页面通过默认浏览器打开,格式如下:

scrapy view [options] <url>

它会将爬取的内容下载到系统临时目录中,打开后发现效果与原网站一致,这个是因为它在<head>中加了<base>标签实现的。

以上是Scrapy提供的全局命令,如果在项目目录下执行Scrapy,它返回的命令清单同时包含了全局命令和项目命令,如下:

项目命令

F:\Codes\fastDemo>scrapy
Scrapy 1.6.0 - project: fastDemo

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

其中:

check

以契约的方式,对Scrapy回调进行测试,格式如下:

scrapy check [options] <spider>

目前内置的契约有:

@url url 设置使用的示例URL
@cb_kwargs {"arg1": "value1", "arg2": "value2", ...} 为示例请求设置cb_kwargs属性
@returns item(s)|request(s) [min [max]] 设置返回items的数量
@scrapes field_1 field_2 ... 检查所有返回的items是否都包含所需字段

如果内置的不能满足,还可以进行自定义,这个后续详细介绍。

crawl

用于运行爬虫,格式如下:

scrapy crawl [options] <spider>

它与全局命令runspider类似,区别是runspider使用爬虫的文件名运行,而crawl则是通过爬虫的名称。

edit

用于编辑爬虫文件,它使用环境变量的EDITOR中指定的编辑器进行编辑,格式如下:

scrapy edit <spider>

注意设置EDITOR时,不能有空格,否则会报错。

list

用于列举项目内容的爬虫清单,格式如下:

scrapy list

 

不允许评论
粤ICP备17049187号-1