2.3 BeautifulSoup 解析网页: 正则表达
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/table.html").read().decode('utf-8')
如果是图片, 它们都藏在这样一个 tag 中:
<td>
<img src="https://morvanzhou.github.io/static/img/course_cover/tf.jpg">
</td>
把正则的 compile 形式放到 BeautifulSoup 的功能中
soup = BeautifulSoup(html, features='lxml')
img_links = soup.find_all("img", {"src": re.compile('.*?\.jpg')})
for link in img_links:
print(link['src'])
"""
https://morvanzhou.github.io/static/img/course_cover/tf.jpg
https://morvanzhou.github.io/static/img/course_cover/rl.jpg
https://morvanzhou.github.io/static/img/course_cover/scraping.jpg
"""
course_links = soup.find_all('a', {'href': re.compile('.*?')})
for link in course_links:
print(link['href'])
"""
https://morvanzhou.github.io/
https://morvanzhou.github.io/tutorials/scraping
https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
"""
Last updated
Was this helpful?