Scrapy提供了丰富的基于命令行工具,分为全局命令和项目命令。当不是在项目目录下执行scrapy时,返回的是全局命令。
全局命令
F:\>Scrapy
Scrapy 1.6.0 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
bench
用于快速基准测试,用于测试电脑可达到的最快抓取速度。例如:
F:\>scrapy bench
2019-12-29 23:35:22 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-12-29 23:35:22 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2019-12-29 23:35:23 [scrapy.crawler] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'}
2019-12-29 23:35:23 [scrapy.extensions.telnet] INFO: Telnet Password: 62eabe350420384d
2019-12-29 23:35:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.logstats.LogStats']
2019-12-29 23:35:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-12-29 23:35:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-12-29 23:35:24 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-12-29 23:35:24 [scrapy.core.engine] INFO: Spider opened
2019-12-29 23:35:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:24 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-12-29 23:35:25 [scrapy.extensions.logstats] INFO: Crawled 61 pages (at 3660 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:26 [scrapy.extensions.logstats] INFO: Crawled 109 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:27 [scrapy.extensions.logstats] INFO: Crawled 165 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:28 [scrapy.extensions.logstats] INFO: Crawled 221 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:29 [scrapy.extensions.logstats] INFO: Crawled 261 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:30 [scrapy.extensions.logstats] INFO: Crawled 301 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:31 [scrapy.extensions.logstats] INFO: Crawled 341 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:32 [scrapy.extensions.logstats] INFO: Crawled 373 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:33 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:34 [scrapy.extensions.logstats] INFO: Crawled 445 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2019-12-29 23:35:34 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2019-12-29 23:35:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 188301,
'downloader/request_count': 461,
'downloader/request_method_count/GET': 461,
'downloader/response_bytes': 1244709,
'downloader/response_count': 461,
'downloader/response_status_count/200': 461,
'finish_reason': 'closespider_timeout',
'finish_time': datetime.datetime(2019, 12, 29, 15, 35, 35, 95122),
'log_count/INFO': 19,
'request_depth_max': 16,
'response_received_count': 461,
'scheduler/dequeued': 461,
'scheduler/dequeued/memory': 461,
'scheduler/enqueued': 9221,
'scheduler/enqueued/memory': 9221,
'start_time': datetime.datetime(2019, 12, 29, 15, 35, 24, 264562)}
2019-12-29 23:35:35 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)
从输出可以看到,它一共发送了461个请求,抓取速度大约在1920~3660个页面每分钟。它这个抓取速度只能作为一个参考,一般我们正常的爬虫还需要提取数据、数据存储等大量操作,而且遇到一些反爬网站,还需以低速确证稳定,是达不到它的这个基准值的。
fetch
使用scrapy下载器下载页面并输出到标准输出,例如:
scrapy fetch http://www.mirthsoft.com
但一般页面的内容比较长,在命令行中并不能完全显示所有内容。
genspider
用于创建爬虫,这个在Scrapy快速预览也在使用,格式如下:
scrapy genspider [-t template] name domain
runspider
用于在项目目录外运行爬虫,这个与crawl命令类似,格式如下:
scrapy runspider [options] <spider_file>
如果想运行Scrapy快速预览中创建的爬虫,可以这样子:
scrapy runspider F:/Codes/fastDemo/fastDemo/spiders/demo.py
这样执行的结果与使用crawl命令一致。
settings
用于获取和设置配置参数,具体用法还未确定,格式如下:
scrapy settings [options]
shell
用于交互式提取页面内容测试,是一个非常实用的命令,格式如下:
scrapy shell [url|file]
例如:
scrapy shell http://www.mirthsoft.com
这样可以在命令行中测试脚本并直接应用到爬虫代码中,这个后续会详细介绍。
startprojec
用于创建新的项目,这个在Scrapy快速预览也有应用,格式如下:
scrapy startproject <project_name> [project_dir]
version
用于查看Scrapy版本,格式如下:
scrapy version
view
将爬取到的页面通过默认浏览器打开,格式如下:
scrapy view [options] <url>
它会将爬取的内容下载到系统临时目录中,打开后发现效果与原网站一致,这个是因为它在<head>中加了<base>标签实现的。
以上是Scrapy提供的全局命令,如果在项目目录下执行Scrapy,它返回的命令清单同时包含了全局命令和项目命令,如下:
项目命令
F:\Codes\fastDemo>scrapy
Scrapy 1.6.0 - project: fastDemo
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
其中:
check
以契约的方式,对Scrapy回调进行测试,格式如下:
scrapy check [options] <spider>
目前内置的契约有:
@url url 设置使用的示例URL
@cb_kwargs {"arg1": "value1", "arg2": "value2", ...} 为示例请求设置cb_kwargs属性
@returns item(s)|request(s) [min [max]] 设置返回items的数量
@scrapes field_1 field_2 ... 检查所有返回的items是否都包含所需字段
如果内置的不能满足,还可以进行自定义,这个后续详细介绍。
crawl
用于运行爬虫,格式如下:
scrapy crawl [options] <spider>
它与全局命令runspider类似,区别是runspider使用爬虫的文件名运行,而crawl则是通过爬虫的名称。
edit
用于编辑爬虫文件,它使用环境变量的EDITOR中指定的编辑器进行编辑,格式如下:
scrapy edit <spider>
注意设置EDITOR时,不能有空格,否则会报错。
list
用于列举项目内容的爬虫清单,格式如下:
scrapy list