2.3 BeautifulSoup 解析网页: 正则表达

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/table.html").read().decode('utf-8')

如果是图片, 它们都藏在这样一个 tag 中:

<td>
    <img src="https://morvanzhou.github.io/static/img/course_cover/tf.jpg">
</td>

把正则的 compile 形式放到 BeautifulSoup 的功能中

soup = BeautifulSoup(html, features='lxml')

img_links = soup.find_all("img", {"src": re.compile('.*?\.jpg')})
for link in img_links:
    print(link['src'])

"""
https://morvanzhou.github.io/static/img/course_cover/tf.jpg
https://morvanzhou.github.io/static/img/course_cover/rl.jpg
https://morvanzhou.github.io/static/img/course_cover/scraping.jpg
"""

course_links = soup.find_all('a', {'href': re.compile('.*?')})
for link in course_links:
    print(link['href'])

"""
https://morvanzhou.github.io/
https://morvanzhou.github.io/tutorials/scraping
https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
"""

Previous2.2 BeautifulSoup 解析网页: CSS Next2.4 小练习: 爬百度百科

Last updated 6 years ago

Was this helpful?