# 5.2  高级爬虫: 高效无忧的 Scrapy 爬虫库

**Scrapy** 是一个整合了的爬虫框架, 有着非常健全的管理系统. 也是分布

式爬虫

![](https://morvanzhou.github.io/static/results/scraping/5-2-2.png)

一定还要给这个 **spider** 一个名字,给定一些初始爬取的网页, 写在**start\_urls** 里,在 **scrapy** 中它自动帮你去重

```python
import scrapy

class MofanSpider(scrapy.Spider):
    name = "mofan"
    start_urls = [
        'https://morvanzhou.github.io/',
    ]
    # unseen = set()
    # seen = set()      # 我们不在需要 set 了, 它自动去重
    def parse(self, response):
        yield {     # return some results
            'title': response.css('h1::text').extract_first(default='Missing').strip().replace('"', ""),
            'url': response.url,
        }

        urls = response.css('a::attr(href)').re(r'^/.+?/$')     # find all sub urls
        for url in urls:
            yield response.follow(url, callback=self.parse)     # it will filter duplication automatically
```

不需要使用 **urljoin()**,在 **follow()** 这一步会自动检测 **url** 的格式

```python
$ scrapy runspider 5-2-scrapy.py -o res.json -s FEED_EXPORT_ENCODING=utf-8
```

**-o res.json** 这个 **-o** 就是输出的指令, 可以在那个文件夹中找到一个名字叫 **res.json** 的文件, 里面存有所有找到的 **{title:, url:}**.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://baozoulin.gitbook.io/python/gao-ji-pa-chong/52-gao-ji-pa-866b3a-gao-xiao-wu-you-de-scrapy-pa-chong-ku.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
