import reres = re.findall(r"<title>(.+?)</title>", html)print("\nPage title is: ", res[0])# Page title is: Scraping tutorial 1 | 莫烦Python
想要找到中间的段落 < p>, 因为这个段落在 HTML 中还夹杂着 tab,new line, 所以给一个 flags=re.DOTALL 来对这些 tab, new line不敏感
res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL)# re.DOTALL if multi lineprint("\nPage paragraph is: ", res[0])# Page paragraph is:# 这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a># <a href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程</a> 中的简单测试.
找所有的链接
res = re.findall(r'href="(.*?)"', html)print("\nAll links: ", res)# All links:['https://morvanzhou.github.io/static/img/description/tab_icon.png','https://morvanzhou.github.io/','https://morvanzhou.github.io/tutorials/scraping']