# Beautiful Soup - 从HTML或XML文件中提取数据的Python库

**Repository Path**: komavideo/LearnBeautifulSoup

## Basic Information

- **Project Name**: Beautiful Soup - 从HTML或XML文件中提取数据的Python库
- **Description**: 从HTML或XML文件中提取数据的Python库
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 4
- **Forks**: 2
- **Created**: 2019-06-13
- **Last Updated**: 2021-03-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

【Python】美女爬虫 - 指定URL下载所有的图片
=============================

## 使用技术

+ Beautiful Soup
+ sys, os, time, urllib

### Beautiful Soup安装

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

```bash
$ pip install beautifulsoup4
```

## 使用方法

```bash
$ py main.py "http://domain-name.com"
```

## 代码(main.py)

```python
# py main.py "http://komavideo.com/"
# py main.py "https://news.goo.ne.jp/entertainment/talent/"
# py main.py "https://kakaku.com/game/"
# py main.py "https://techcrunch.com/"
# py main.py "https://oschina.net"

import sys, os, time
import urllib.request, urllib.error, urllib.request, urllib.parse
import pprint as pp
from bs4 import BeautifulSoup

userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"

def download_img(url, dst_path):
    try:
        req = urllib.request.Request(url, headers={'User-Agent': userAgent})
        res = urllib.request.urlopen(req)
        img_data = res.read()
        with open(dst_path, mode="wb") as f:
            f.write(img_data)
    except urllib.error.URLError as ex:
        print(ex)

if (len(sys.argv) < 2):
    print("\n", "usage: py main.py [url]")
    exit(0)

# 命令行url取得
url = sys.argv[1]

# url请求
req = urllib.request.Request(url, headers={'User-Agent': userAgent})
html = urllib.request.urlopen(req)

# 解析获取的html
soup = BeautifulSoup(html, "html.parser")
# print(soup.html)
# print(soup.text)

# 取出所有的图片标记
img_list = soup.find_all('img')

# 整理图片的url，生成新的数组
url_list = []
for img in img_list:
    if img.get('src') == None:
        continue

    tmp = img['src'].strip()
    if tmp.startswith("https://") or tmp.startswith("http://"):
        # url_list.append(tmp)
        pass
    else:
        # url_list.append(urllib.parse.urljoin(url, tmp))
        tmp = urllib.parse.urljoin(url, tmp)
    
    if not tmp in url_list:
        url_list.append(tmp)
# pp.pprint(url_list, indent=4)

# 建立下载文件夹
download_dir = './out'
if not os.path.exists(download_dir):
    os.mkdir(download_dir)

# 下载处理
digits_width = len(str(len(url_list)))
count = 0
for url in url_list:
    count = count + 1

    # 文件名整理：01->
    fmt_url = url
    if fmt_url.find("?") > -1:
        fmt_url = fmt_url[0:fmt_url.find("?")]
    file, ext = os.path.splitext(fmt_url)
    filename = str(count).zfill(digits_width) + ext
    if len(ext) == 0:
        filename = filename + ".png"

    dst_path = os.path.join(download_dir, filename)
    print(url)
    print("", "->", dst_path)

    time.sleep(0.1) # 下载延迟，减少服务器负荷
    download_img(url, dst_path)
```

## 课程文件

https://gitee.com/komavideo/LearnBeautifulSoup

## 小马视频频道

http://komavideo.com