Code repository for the O'Reilly publication "Building Machine Learning Pipelines" by Hannes Hapke & Catherine Nelson
Download the initial dataset. From the root of this repository, execute
python3 utils/download_dataset.py
After this script runs, you should have a data
folder containing the file consumer_complaints_with_narrative.csv
.
The data that we use in this example project can be downloaded using the script above. The dataset is from a public dataset on customer complaints collected from the US Consumer Finance Protection Bureau. If you would like to reproduce our edited dataset, carry out the following steps:
[ "product", "sub_product", "issue", "sub_issue", "consumer_complaint_narrative", "company", "state", "zip_code", "company", "company_response", "timely_response", "consumer_disputed"]
consumer_complaint_narrative
columnconsumer_disputed
column, map Yes
to 1
and No
to 0
Before building our TFX pipeline, we experimented with different feature engineering and model architectures. The notebooks in this folder preserve our experiments, and we then refactored our code into the interactive pipeline below.
The interactive-pipeline
folder contains a full interactive TFX pipeline for the consumer complaint data.
The pipelines
folder contains complete pipelines for the various orchestrators. See Chapters 11 and 12 for full details.
The following subfolders contain stand-alone code for individual chapters.
Chapter 7. Stand-alone code for TFMA, Fairness Indicators, What-If Tool. Note that these notebooks will not work in JupyterLab.
Chapter 10. Notebook outlinining the implementation of custom TFX components from scratch and by inheriting existing functionality. Presented at the Apache Beam Summit 2020.
Chapter 14. Code for training a differentially private version of the demo project. Note that the TF-Privacy module only supports TF 1.x as of June 2020.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。