爬蟲

【Python 網頁爬蟲入門實戰】ch2 HW

2018-12-212019-07-15 Jumping發表留言

目標 1:
・找出範例網頁一總共有幾篇 blog 貼文
・找出範例網頁一總共有幾張圖片網址含有 ‘crawler’ 字串

import requests
from bs4 import BeautifulSoup
import re

resp = requests.get("http://blog.castman.net/web-crawler-tutorial/ch2/blog/blog.html")
soup = BeautifulSoup(resp.text, "html5lib")

titles = []
for t in soup.find_all("h4"):
    titles.append(t)
    print(t.text.strip())

print("此部落格共有 " + str(len(titles)) + " 篇文章")

img_len = 0
#若沒有使用這方法，最後會輸出那一行有幾個字元
#而不是原本目標的幾篇or幾張

for img in soup.find_all("img", {"src": re.compile(".*crawler")}):
   print(img["src"])
   img_len += 1

print("此部落格共有 " + str(img_len) + " 張網址含有 'crawler' 字串的圖片")

Output 1

Mac使用者
給初學者的 Python 網頁爬蟲與資料分析
給初學者的 Python 網頁爬蟲與資料分析
給初學者的 Python 網頁爬蟲與資料分析
給初學者的 Python 網頁爬蟲與資料分析
給初學者的 Python 網頁爬蟲與資料分析
此部落格共有 6 篇文章
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
此部落格共有 5 張網址含有 ‘crawler’ 字串的圖片

目標 2:
・找出範例網頁二總共有幾堂課程

import requests
from bs4 import BeautifulSoup
import re

resp = requests.get("http://blog.castman.net/web-crawler-tutorial/ch2/table/table.html")
soup = BeautifulSoup(resp.text, "html5lib")

titles = []
lessons = 0

for row in soup.find_all("tr")[1:]:
    tds = row.find_all("td")[0]
    titles.append(tds)
    lessons += 1
    print(tds.text)

print("此網頁共有 " + str(len(titles)) + " 堂課程")

Output 2

初心者 – Python入門
Python 網頁爬蟲入門實戰
Python 機器學習入門實戰 (預計)
Python 資料科學入門實戰 (預計)
Python 資料視覺化入門實戰 (預計)
Python 網站架設入門實戰 (預計)
此網頁共有 6 堂課程

發表留言取消回覆