项目介绍

验证码识别 - 该项目是基于 CNN5/DenseNet+BLSTM/LSTM+CTC 来实现验证码识别. 该项目仅用于训练，如果需要部署模型请移步：

https://github.com/kerlomz/captcha_platform （通用WEB服务，HTTP请求调用）

https://github.com/kerlomz/captcha_library_c （动态链接库，DLL调用，基于TensoFlow C++）

https://github.com/kerlomz/captcha_demo_csharp （C#源码调用，基于TensorFlowSharp）

许多人问我，部署识别也需要GPU吗？我的答案是，完全没必要。理想中是用GPU训练，使用CPU部署识别服务，部署如果也需要这么高的成本，那还有什么现实意义和应用场景呢，实测阿里云最低配1核1G的配置识别1次大约30ms，我的i7-8700k大约10-15ms之间。

注意事项

如何使用CPU训练：

本项目默认安装TensorFlow-GPU版，建议使用GPU进行训练，如需换用CPU训练请替换 requirements.txt 文件中的tensorflow-gpu==1.6.0 为tensorflow==1.6.0，其他无需改动。
关于LSTM网络:

保证CNN得到的featuremap输入到LSTM时的宽度至少大于等于最大字符数的3倍左右，即time_step大于等于最大字符数3倍。
No valid path found 问题解决：

在model.yaml中修改Pretreatment->Resize的参数，自行调整为合适的值，总结了百来个验证码训练经验，可以尝试这个较为通用的值：Resize: [150, 50]，或者使用代码tutorial.py （自动生成配置文件、打包样本、训练一体化），填写训练集路径执行。
参数修改：

切记，ModelName 是绑定一个模型的唯一标志，如果修改了训练参数如：ImageWidth，ImageHeight，Resize，CharSet，CNNNetwork，RecurrentNetwork，HiddenNum 这类影响计算图的参数，需要删除model路径下的旧文件，重新训练，或者使用新的ModelName 重新训练，否则默认作为断点续练。

准备工作

如果你准备使用GPU训练，请先安装CUDA和cuDNN，可以了解下官方测试过的编译版本对应: https://www.tensorflow.org/install/install_sources#tested_source_configurations Github上可以下载到第三方编译好的TensorFlow的WHL安装包：

https://github.com/fo40225/tensorflow-windows-wheel

CUDA下载地址：https://developer.nvidia.com/cuda-downloads

cuDNN下载地址：https://developer.nvidia.com/rdp/form/cudnn-download-survey （需要注册账号）

笔者使用的版本为：CUDA10+cuDNN7.3.1+TensorFlow 1.12

环境安装

安装Python 3.6 环境（包含pip）
安装虚拟环境 virtualenv pip3 install virtualenv

为该项目创建独立的虚拟环境:

virtualenv -p /usr/bin/python3 venv # venv is the name of the virtual environment.
cd venv/ # venv is the name of the virtual environment.
source bin/activate # to activate the current virtual environment.
cd captcha_trainer # captcha_trainer is the project path.

安装本项目的依赖列表：pip install -r requirements.txt

开始

1. 架构与流程

本项目依赖于训练配置config.yaml和模型配置model.yaml，初始化项目的时候请复制config_demo.yaml到当前目录下命名为config.yaml，model_demo.yaml同理。或者可以使用tutorial.py 自动设置模型配置。

训练流程：配置好两个配置文件后，执行trains.py 中的代码，读取配置，根据model.yaml配置文件构建神经网络计算图，依据config.yaml的配置参数进行训练。

关于config.yaml中的训练参数有几点建议：

BatchSize（训练批次大小）与TestBatchSize（测试批次大小）是需要大家关注的，建议根据显卡条件进行调整，显存小的建议BatchSize不要太大，TestBatchSize也是，我提供的默认配置是基于显存8G，使用率50%设置的，请悉知。
LearningRate（学习率）也是需要关注的，深度学习本质就是调参，一般的模型可以保持默认的配置无需调整，有些模型想要获得更高的识别精度可以先使用0.01快速收敛，准确率差不多95%左右再使用0.001/0.0001提高精度。
TestSetNum（测试集数目），这个是专门为懒人（说我自己）设计提供的，根据给定的测试集数目切割训练集，有一个前提，测试集必须是随机的，随机的，随机的，重要的事说三遍，有些人用Windows资源管理器打开，一拖动选择几百个，默认都是按名称排序的，如果名称是标注，那么就不是随机了，也就是很可能你取的测试集是标注为0~3之间的图片，这样可能导致永远无法收敛。

TrainRegex 和 TestRegex，正则匹配，请各位采集样本的时候，尽量和我给的示例保持一致吧，正则问题请谷歌，如果是为1111.jpg这种命名的话，这里提供了一个批量转换的代码：

import re
import os
import hashlib

# 训练集路径
root = r"D:\TrainSet\***"
all_files = os.listdir(root)

for file in all_files:
    old_path = os.path.join(root, file)
    
    # 已被修改过忽略
    if len(file.split(".")[0]) > 32:
        continue
    
    # 采用标注_文件md5码.图片后缀 进行命名
    with open(old_path, "rb") as f:
        _id = hashlib.md5(f.read()).hexdigest()
    new_path = os.path.join(root, file.replace(".", "_{}.".format(_id)))
    
    # 重复标签的时候会出现形如：abcd (1).jpg 这种形式的文件名
    new_path = re.sub(" \(\d+\)", "", new_path)
    print(new_path)
    os.rename(old_path, new_path)

2. 配置化

config.yaml - System Config

# - requirement.txt  -  GPU: tensorflow-gpu, CPU: tensorflow
# - If you use the GPU version, you need to install some additional applications.
# TrainRegex and TestRegex: Default matching apple_20181010121212.jpg file.
# - The Default is .*?(?=_.*\.)
# TrainsPath and TestPath: The local absolute path of your training and testing set.
# TestSetNum: This is an optional parameter that is used when you want to extract some of the test set
# - from the training set when you are not preparing the test set separately.
System:
  DeviceUsage: 0.5
  TrainRegex: '.*?(?=_)'
  TestRegex: '.*?(?=_)'
  TestSetNum: 300

# CNNNetwork: [CNN5, DenseNet]
# RecurrentNetwork: [BLSTM, LSTM]
# - The recommended configuration is CNN5+BLSTM / DenseNet+BLSTM
# HiddenNum: [64, 128, 256]
# - This parameter indicates the number of nodes used to remember and store past states.
NeuralNet:
  CNNNetwork: CNN5
  RecurrentNetwork: BLSTM
  HiddenNum: 64
  KeepProb: 0.99

# SavedEpochs: A Session.run() execution is called a Epochs,
# - Used to save traininsed to calculate accuracy, Default value is 100.
# TestNum: The number of samples for each test batch.
# - A test for every saved steps.
# CompileAcc: When the accuracy reaches the set threshold,
# - the model will be compiled together each time it is archived.
# - Available for specific usage scenarios.
# EndAcc: Finish the training when the accuracy reaches [EndAcc*100]%.
# EndEpochs: Finish the training when the epoch is greater than the defined epoch.
# PreprocessCollapseRepe ated: If True, then a preprocessing step runs
# - before loss calculation, wherein repeated labels passed to the loss
# - are merged into single labels.  This is useful if the training labels come
# - from, e.g., forced alignments and therefore have unnecessary repetitions.
# CTCMergeRepeated: If False, then deep within the CTC calculation,
# - repeated non-blank labels will not be merged and are interpreted
# - as individual labels. This is a simplified (non-standard) version of CTC.
Trains:
  SavedSteps: 100
  ValidationSteps: 500
  EndAcc: 0.98
  EndEpochs: 1
  BatchSize: 64
  TestBatchSize: 300
  LearningRate: 0.01
  DecayRate: 0.98
  DecaySteps: 100000
  PreprocessCollapseRepeated: False
  CTCMergeRepeated: True

There are several common examples of TrainRegex: i. apple_20181010121212.jpg

.*?(?=_.*\.)

ii apple.png

.*?(?=\.)

model.yaml - Model Config

# Sites: A bindable parameter used to select a model. 
# - If this parameter is defined, 
# - it can be identified by using the model_site parameter 
# - to identify a model that is inconsistent with the actual size of the current model.
# ModelName: Corresponding to the model file in the model directory,
# - such as YourModelName.pb, fill in YourModelName here.
# ModelType: This parameter is also used to locate the model. 
# - The difference from the sites is that if there is no corresponding site, 
# - the size will be used to assign the model. 
# - If a model of the corresponding size and corresponding to the ModelType is not found, 
# - the model belonging to the category is preferentially selected.
# CharSet: Provides a default optional built-in solution:
# - [ALPHANUMERIC, ALPHANUMERIC_LOWER, ALPHANUMERIC_UPPER,
# -- NUMERIC, ALPHABET_LOWER, ALPHABET_UPPER, ALPHABET]
# - Or you can use your own customized character set like: ['a', '1', '2'].
# CharExclude: CharExclude should be a list, like: ['a', '1', '2']
# - which is convenient for users to freely combine character sets.
# - If you don't want to manually define the character set manually,
# - you can choose a built-in character set
# - and set the characters to be excluded by CharExclude parameter.
Model:
  Sites: []
  ModelName: YourModelName
  ModelType: 150x50
  CharSet: ALPHANUMERIC_LOWER
  CharExclude: []
  CharReplace: {}
  ImageWidth: 150
  ImageHeight: 50

# Binaryzation: [-1: Off, >0 and < 255: On].
# Smoothing: [-1: Off, >0: On].
# Blur: [-1: Off, >0: On].
Pretreatment:
  Binaryzation: -1
  Smoothing: -1
  Blur: -1
  Resize: [150, 50]

工具集

预处理预览工具，只支持为打包的训练集查看 python -m tools.preview
新手指南（只支持字符集推荐，我觉得是个鸡肋各位请忽略） python -m tools.navigator
PyInstaller 一键打包（训练的话支持不好，部署的打包效果不错）
```
pip install pyinstaller
python -m tools.package
```

运行

命令行或终端运行：python trains.py
使用 PyCharm 运行，右键 Run

开源许可

遵循SATA原则，使用本项目请不要吝啬给颗星

详细指南

之前专门为该项目写的文章，欢迎大家点评

https://www.jianshu.com/p/80ef04b16efc

捐赠

手机保存下来就是这么大的图，有点尴尬，感谢有你们的支持让我更有动力贡献社区。

江北青衣 / captcha_trainer

项目介绍

注意事项

准备工作

环境安装

开始

1. 架构与流程

2. 配置化

工具集

运行

开源许可

详细指南

捐赠

简介

发行版

贡献者

近期动态

江北青衣 / captcha_trainer .gitee-modal { width: 500px !important; }

项目介绍

注意事项

准备工作

环境安装

开始

1. 架构与流程

2. 配置化

工具集

运行

开源许可

详细指南

捐赠

简介

发行版

贡献者

近期动态

搜索帮助

江北青衣 / captcha_trainer