1 Star 0 Fork 0

littleTesting/building-machine-learning-pipelines

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT

Building Machine Learning Pipelines

Code repository for the O'Reilly publication "Building Machine Learning Pipelines" by Hannes Hapke & Catherine Nelson

Set up the demo project

Download the initial dataset. From the root of this repository, execute

python3 utils/download_dataset.py

After this script runs, you should have a data folder containing the file consumer_complaints_with_narrative.csv.

The dataset

The data that we use in this example project can be downloaded using the script above. The dataset is from a public dataset on customer complaints collected from the US Consumer Finance Protection Bureau. If you would like to reproduce our edited dataset, carry out the following steps:

  • Download the dataset from https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data
  • Rename the columns to [ "product", "sub_product", "issue", "sub_issue", "consumer_complaint_narrative", "company", "state", "zip_code", "company", "company_response", "timely_response", "consumer_disputed"]
  • Filter the dataset to remove rows with missing data in the consumer_complaint_narrative column
  • In the consumer_disputed column, map Yes to 1 and No to 0

Pre-pipeline experiment

Before building our TFX pipeline, we experimented with different feature engineering and model architectures. The notebooks in this folder preserve our experiments, and we then refactored our code into the interactive pipeline below.

Interactive pipeline

The interactive-pipeline folder contains a full interactive TFX pipeline for the consumer complaint data.

Full pipelines with Apache Beam, Apache Airflow, Kubeflow Pipelines, GCP

The pipelines folder contains complete pipelines for the various orchestrators. See Chapters 11 and 12 for full details.

Chapters

The following subfolders contain stand-alone code for individual chapters.

Model analysis

Chapter 7. Stand-alone code for TFMA, Fairness Indicators, What-If Tool. Note that these notebooks will not work in JupyterLab.

Advanced TFX

Chapter 10. Notebook outlinining the implementation of custom TFX components from scratch and by inheriting existing functionality. Presented at the Apache Beam Summit 2020.

Data privacy

Chapter 14. Code for training a differentially private version of the demo project. Note that the TF-Privacy module only supports TF 1.x as of June 2020.

Version notes

  • As of 9/14/20, TFX does not support Python 3.8
MIT License Copyright (c) 2020 Hannes Hapke Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

暂无描述 展开 收起
取消

发行版

暂无发行版

贡献者

全部

近期动态

不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/littleTesting/building-machine-learning-pipelines.git
git@gitee.com:littleTesting/building-machine-learning-pipelines.git
littleTesting
building-machine-learning-pipelines
building-machine-learning-pipelines
master

搜索帮助