Python爬虫学习笔记
  • Introduction
  • 爬虫简介
    • 1.1 了解网页结构
  • BeautifulSoup 解析网页
    • 2.1 BeautifulSoup 解析网页: 基础
    • 2.2 BeautifulSoup 解析网页: CSS
    • 2.3 BeautifulSoup 解析网页: 正则表达
    • 2.4 小练习: 爬百度百科
  • 更多请求/下载方式
    • 3.1 多功能的 Requests
    • 3.2 下载文件
    • 3.3 小练习: 下载美图
  • 加速你的爬虫
    • 4.1 加速爬虫: 多进程分布式
    • 4.2 加速爬虫: 异步加载 Asyncio
  • 高级爬虫
    • 5.1 高级爬虫: 让 Selenium 控制你的浏览器帮你爬
    • 5.2 高级爬虫: 高效无忧的 Scrapy 爬虫库
Powered by GitBook
On this page

Was this helpful?

  1. BeautifulSoup 解析网页

2.3 BeautifulSoup 解析网页: 正则表达

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/table.html").read().decode('utf-8')

如果是图片, 它们都藏在这样一个 tag 中:

<td>
    <img src="https://morvanzhou.github.io/static/img/course_cover/tf.jpg">
</td>

把正则的 compile 形式放到 BeautifulSoup 的功能中

soup = BeautifulSoup(html, features='lxml')

img_links = soup.find_all("img", {"src": re.compile('.*?\.jpg')})
for link in img_links:
    print(link['src'])

"""
https://morvanzhou.github.io/static/img/course_cover/tf.jpg
https://morvanzhou.github.io/static/img/course_cover/rl.jpg
https://morvanzhou.github.io/static/img/course_cover/scraping.jpg
"""
course_links = soup.find_all('a', {'href': re.compile('.*?')})
for link in course_links:
    print(link['href'])

"""
https://morvanzhou.github.io/
https://morvanzhou.github.io/tutorials/scraping
https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
"""
Previous2.2 BeautifulSoup 解析网页: CSSNext2.4 小练习: 爬百度百科

Last updated 6 years ago

Was this helpful?