douban-movie

1.前言

参考了作者布咯咯_rieuse的《爬取豆瓣电影中速度与激情8演员图片》一文，原文地址：爬取豆瓣电影中速度与激情8演员图片。于是开始学习、模仿、改进。把自己学习的过程在此整理，与大家一起分享。目标：爬取图片，并用对应图片的明星中文名字为文件名保存于电脑中。地址：战狼2

2、分析

本次爬取使用第三方的requests库，同时也是使用了urllib.urlretrieve函数下载文件。解析主要使用了BeautifulSoup，部分也使用了re正则表达式

导入需要的库

import requests
from bs4 import BeautifulSoup

获取地址，使用BS库并使用lxml解析器

html=requests.get(url).content
soup=BeautifulSoup(html,'lxml')

抓取<title>标签中的片名作为文件的保存目录

movie=soup.title.string.split(' ')[0]

抓取演员和对应的图片url 从下图中分析可知中间的
对应的才是所有演员的信息，所以我们用BS抓取中间的部分，代码如下：

tags=soup.find_all(class_='list-wrapper')              #BS遍历所有的list-wrapper类
starts=[]
for tag in tags[1].find_all('li')                      #使用list-wrapper[1] 获取对应的演员的类，再次遍历其下的li 标签
    title=tag.a['title'].split(' ')[0]                 #获取演员名字，去除后面英文名字
    img_url=re.findall(r'https://img\d.doubanio.com/img/celebrity/medium/.*.jpg',str(tag))[0]  #正则表达式获取图片信息，正则表达式返回列表，使用[0]获取数据
     stars.append([title,img_url])                     #追加拼装数据

Name	Name	Last commit message	Last commit date
parent directory ..
png	png	pic for zhanlang	Aug 21, 2017
README.md	README.md	Update README.md	Aug 21, 2017
douban_zhanlang.py	douban_zhanlang.py	Update douban_zhanlang.py	Aug 18, 2017
geturl.py	geturl.py	Rename giturl.py to geturl.py	Aug 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

douban-movie

douban-movie

README.md

1.前言

2、分析

3、完整代码参见github

Collapse file tree

Files

douban-movie

Directory actions

More options

Directory actions

More options

Latest commit

History

douban-movie

Folders and files

parent directory

README.md

1.前言

2、分析

3、完整代码参见github