# analytics-zoo
**Repository Path**: fcy_gitee/analytics-zoo
## Basic Information
- **Project Name**: analytics-zoo
- **Description**: Distributed Tensorflow, Keras and BigDL on Apache Spark
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-07-08
- **Last Updated**: 2020-12-19
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README

_A unified analytics + AI platform for **distributed TensorFlow, Keras and BigDL on Apache Spark**_
---
## What is Analytics Zoo?
__Analytics Zoo__ provides a unified analytics + AI platform that seamlessly unites *__Spark, TensorFlow, Keras and BigDL__* programs into an integrated pipeline; the entire pipeline can then transparently scale out to a large Hadoop/Spark cluster for distributed training or inference.
- _Data wrangling and analysis using PySpark_
- _Deep learning model development using TensorFlow or Keras_
- _Distributed training/inference on Spark and BigDL_
- _All within a single unified pipeline and in a user-transparent fashion!_
In addition, Analytics Zoo also provides a rich set of analytics and AI support for the end-to-end pipeline, including:
- *Easy-to-use abstractions and APIs* (e.g., transfer learning support, autograd operations, Spark DataFrame and ML pipeline support, online model serving API, etc.)
- *Common feature engineering operations* (for image, text, 3D image, etc.)
- *Built-in deep learning models* (e.g., object detection, image classification, text classification, recommendation, anomaly detection, text matching, sequence to sequence etc.)
- *Reference use cases* (e.g., anomaly detection, sentiment analysis, fraud detection, image similarity, etc.)
## How to use Analytics Zoo?
- To get started, please refer to the [Python install guide](https://analytics-zoo.github.io/master/#PythonUserGuide/install/) or [Scala install guide](https://analytics-zoo.github.io/master/#ScalaUserGuide/install/).
- For running distributed TensorFlow/Keras on Spark and BigDL, please refer to the quick start [here](#distributed-tensorflow-and-keras-on-sparkbigdl) and the details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/tensorflow/).
- For more information, You may refer to the [Analytics Zoo document website](https://analytics-zoo.github.io/master/).
- For additional questions and discussions, you can join the [Google User Group](https://groups.google.com/forum/#!forum/bigdl-user-group) (or subscribe to the [Mail List](mailto:bigdl-user-group+subscribe@googlegroups.com)).
---
## Overview
- [Distributed TensorFlow and Keras on Spark/BigDL](#distributed-tensorflow-and-keras-on-sparkbigdl)
- Data wrangling and analysis using PySpark
- Deep learning model development using TensorFlow or Keras
- Distributed training/inference on Spark and BigDL
- All within a single unified pipeline and in a user-transparent fashion!
- [High level abstractions and APIs](#high-level-abstractions-and-apis)
- [Transfer learning](#transfer-learning): customize pretrained model for *feature extraction or fine-tuning*
- [`autograd`](#autograd): build custom layer/loss using *auto differentiation operations*
- [`nnframes`](#nnframes): native deep learning support in *Spark DataFrames and ML Pipelines*
- [Model serving](#model-serving): productionize *model serving and inference* using [POJO](https://en.wikipedia.org/wiki/Plain_old_Java_object) APIs
- [Built-in deep learning models](#built-in-deep-learning-models)
- [Object detection API](#object-detection-api): high-level API and pretrained models (e.g., SSD and Faster-RCNN) for *object detection*
- [Image classification API](#image-classification-api): high-level API and pretrained models (e.g., VGG, Inception, ResNet, MobileNet, etc.) for *image classification*
- [Text classification API](#text-classification-api): high-level API and pre-defined models (using CNN, LSTM, etc.) for *text classification*
- [Recommendation API](#recommendation-api): high-level API and pre-defined models (e.g., Neural Collaborative Filtering, Wide and Deep Learning, etc.) for *recommendation*
- [Anomaly detection API](#anomaly-detection-api): high-level API and pre-defined models based on LSTM for *anomaly detection*
- [Text matching API](#text-matching-api): high-level API and pre-defined KNRM model for *text matching*
- [Sequence to sequence API](#sequence-to-sequence-api): high-level API and pre-defined models for *sequence to sequence*
- [Reference use cases](#reference-use-cases): a collection of end-to-end *reference use cases* (e.g., anomaly detection, sentiment analysis, fraud detection, image augmentation, object detection, variational autoencoder, etc.)
- [Docker images and builders](#docker-images-and-builders)
- [Analytics-Zoo in Docker](#analytics-zoo-in-docker)
- [How to build it](#how-to-build-it)
- [How to use the image](#how-to-use-the-image)
- [Notice](#notice)
## _Distributed TensorFlow and Keras on Spark/BigDL_
To make it easy to build and productionize the deep learning applications for Big Data, Analytics Zoo provides a unified analytics + AI platform that seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline (as illustrated below), which can then transparently run on a large-scale Hadoop/Spark clusters for distributed training and inference. (Please see more details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/tensorflow/)).
1. Data wrangling and analysis using PySpark
```python
from zoo import init_nncontext
from zoo.pipeline.api.net import TFDataset
sc = init_nncontext()
#Each record in the train_rdd consists of a list of NumPy ndrrays
train_rdd = sc.parallelize(file_list)
.map(lambda x: read_image_and_label(x))
.map(lambda image_label: decode_to_ndarrays(image_label))
#TFDataset represents a distributed set of elements,
#in which each element contains one or more TensorFlow Tensor objects.
dataset = TFDataset.from_rdd(train_rdd,
names=["features", "labels"],
shapes=[[28, 28, 1], [1]],
types=[tf.float32, tf.int32],
batch_size=BATCH_SIZE)
```
2. Deep learning model development using TensorFlow
```python
import tensorflow as tf
slim = tf.contrib.slim
images, labels = dataset.tensors
labels = tf.squeeze(labels)
with slim.arg_scope(lenet.lenet_arg_scope()):
logits, end_points = lenet.lenet(images, num_classes=10, is_training=True)
loss = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels))
```
3. Distributed training on Spark and BigDL
```python
from zoo.pipeline.api.net import TFOptimizer
from bigdl.optim.optimizer import MaxIteration, Adam, MaxEpoch, TrainSummary
optimizer = TFOptimizer.from_loss(loss, Adam(1e-3))
optimizer.set_train_summary(TrainSummary("/tmp/az_lenet", "lenet"))
optimizer.optimize(end_trigger=MaxEpoch(5))
```
4. Alternatively, using Keras APIs for model development and distributed training
```python
from zoo.pipeline.api.keras.models import *
from zoo.pipeline.api.keras.layers import *
model = Sequential()
model.add(Reshape((1, 28, 28), input_shape=(28, 28, 1)))
model.add(Convolution2D(6, 5, 5, activation="tanh", name="conv1_5x5"))
model.add(MaxPooling2D())
model.add(Convolution2D(12, 5, 5, activation="tanh", name="conv2_5x5"))
model.add(MaxPooling2D())
model.add(Flatten())
model.add(Dense(100, activation="tanh", name="fc1"))
model.add(Dense(class_num, activation="softmax", name="fc2"))
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam')
model.fit(train_rdd, batch_size=BATCH_SIZE, nb_epoch=5)
```
## _High level abstractions and APIs_
Analytics Zoo provides a set of easy-to-use, high level abstractions and APIs that natively transfer learning, autograd and custom layer/loss, Spark DataFrames and ML Pipelines, online model serving, etc. etc.
### _Transfer learning_
Using the high level transfer learning APIs, you can easily customize pretrained models for *feature extraction or fine-tuning*. (See more details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/transferlearning/))
1. Load an existing model (pretrained in Caffe)
```python
from zoo.pipeline.api.net import *
full_model = Net.load_caffe(def_path, model_path)
```
2. Remove the last few layers
```python
# create a new model by removing layers after pool5/drop_7x7_s1
model = full_model.new_graph(["pool5/drop_7x7_s1"])
```
3. Freeze the first few layers
```python
# freeze layers from input to pool4/3x3_s2 inclusive
model.freeze_up_to(["pool4/3x3_s2"])
```
4. Add a few new layers
```python
from zoo.pipeline.api.keras.layers import *
from zoo.pipeline.api.keras.models import *
inputs = Input(name="input", shape=(3, 224, 224))
inception = model.to_keras()(inputs)
flatten = Flatten()(inception)
logits = Dense(2)(flatten)
newModel = Model(inputs, logits)
```
### _`autograd`_
`autograd` provides automatic differentiation for math operations, so that you can easily build your own *custom loss and layer* (in both Python and Scala), as illustrated below. (See more details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/autograd/))
1. Define model using Keras-style API and `autograd`
```python
import zoo.pipeline.api.autograd as A
from zoo.pipeline.api.keras.layers import *
from zoo.pipeline.api.keras.models import *
input = Input(shape=[2, 20])
features = TimeDistributed(layer=Dense(30))(input)
f1 = features.index_select(1, 0)
f2 = features.index_select(1, 1)
diff = A.abs(f1 - f2)
model = Model(input, diff)
```
2. Optionally define custom loss function using `autograd`
```python
def mean_absolute_error(y_true, y_pred):
return mean(abs(y_true - y_pred), axis=1)
```
3. Train model with *custom loss function*
```python
model.compile(optimizer=SGD(), loss=mean_absolute_error)
model.fit(x=..., y=...)
```
### _`nnframes`_
`nnframes` provides *native deep learning support in Spark DataFrames and ML Pipelines*, so that you can easily build complex deep learning pipelines in just a few lines, as illustrated below. (See more details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/nnframes/))
1. Initialize *NNContext* and load images into *DataFrames* using `NNImageReader`
```python
from zoo.common.nncontext import *
from zoo.pipeline.nnframes import *
from zoo.feature.image import *
sc = init_nncontext()
imageDF = NNImageReader.readImages(image_path, sc)
```
2. Process loaded data using *DataFrames transformations*
```python
getName = udf(lambda row: ...)
getLabel = udf(lambda name: ...)
df = imageDF.withColumn("name", getName(col("image"))).withColumn("label", getLabel(col('name')))
```
3. Processing image using built-in *feature engineering operations*
```
transformer = RowToImageFeature() -> ImageResize(64, 64) -> ImageChannelNormalize(123.0, 117.0, 104.0) \
-> ImageMatToTensor() -> ImageFeatureToTensor())
```
4. Define model using *Keras-style APIs*
```python
from zoo.pipeline.api.keras.layers import *
from zoo.pipeline.api.keras.models import *
model = Sequential().add(Convolution2D(32, 3, 3, activation='relu', input_shape=(1, 28, 28))) \
.add(MaxPooling2D(pool_size=(2, 2))).add(Flatten()).add(Dense(10, activation='softmax')))
```
5. Train model using *Spark ML Pipelines*
```python
classifier = NNClassifier(model, CrossEntropyCriterion(),transformer).setLearningRate(0.003) \
.setBatchSize(40).setMaxEpoch(1).setFeaturesCol("image").setCachingSample(False)
nnModel = classifier.fit(df)
```
### _Model Serving_
Using the [POJO](https://en.wikipedia.org/wiki/Plain_old_Java_object) model serving API, you can productionize model serving and inference in any Java based frameworks (e.g., [Spring Framework](https://spring.io), Apache [Storm](http://storm.apache.org), [Kafka](http://kafka.apache.org) or [Flink](http://flink.apache.org), etc.), as illustrated below:
```python
import com.intel.analytics.zoo.pipeline.inference.AbstractInferenceModel;
import com.intel.analytics.zoo.pipeline.inference.JTensor;
public class TextClassificationModel extends AbstractInferenceModel {
public TextClassificationModel() {
super();
}
}
TextClassificationModel model = new TextClassificationModel();
model.load(modelPath, weightPath);
List inputs = preprocess(...);
List> result = model.predict(inputs);
...
```
## _Built-in deep learning models_
Analytics Zoo provides several built-in deep learning models that you can use for a variety of problem types, such as *object detection*, *image classification*, *text classification*, *recommendation*, *anomaly detection*, *text matching*, *sequence to sequence* etc.
### _Object detection API_
Using *Analytics Zoo Object Detection API* (including a set of pretrained detection models such as SSD and Faster-RCNN), you can easily build your object detection applications (e.g., localizing and identifying multiple objects in images and videos), as illustrated below. (See more details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/object-detection/))
1. Download object detection models in Analytics Zoo
You can download a collection of detection models (pretrained on the PSCAL VOC dataset and COCO dataset) from [detection model zoo](https://analytics-zoo.github.io/master/#ProgrammingGuide/object-detection/#download-link).
2. Use *Object Detection API* for off-the-shell inference
```python
from zoo.models.image.objectdetection import *
model = ObjectDetector.load_model(model_path)
image_set = ImageSet.read(img_path, sc)
output = model.predict_image_set(image_set)
```
### _Image classification API_
Using *Analytics Zoo Image Classification API* (including a set of pretrained detection models such as VGG, Inception, ResNet, MobileNet, etc.), you can easily build your image classification applications, as illustrated below. (See more details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/image-classification/))
1. Download image classification models in Analytics Zoo
You can download a collection of image classification models (pretrained on the ImageNet dataset) from [image classification model zoo](https://analytics-zoo.github.io/master/#ProgrammingGuide/image-classification/#download-link).
2. Use *Image classification API* for off-the-shell inference
```python
from zoo.models.image.imageclassification import *
model = ImageClassifier.load_model(model_path)
image_set = ImageSet.read(img_path, sc)
output = model.predict_image_set(image_set)
```
### _Text classification API_
*Analytics Zoo Text Classification API* provides a set of pre-defined models (using CNN, LSTM, etc.) for text classifications. (See more details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/text-classification/))
### _Recommendation API_
*Analytics Zoo Recommendation API* provides a set of pre-defined models (such as Neural Collaborative Filtering, Wide and Deep Learning, etc.) for recommendations. (See more details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/recommendation/))
### _Anomaly detection API_
*Analytics Zoo Anomaly Detection API* provides a set of pre-defined models based on LSTM to detect anomalies for time series data. (See more details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/anomaly-detection/))
### _Text matching API_
*Analytics Zoo Text Matching API* provides pre-defined KNRM model for ranking or classification. (See more details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/text-matching/))
### _Sequence to sequence API_
*Analytics Zoo Sequence to Sequence API* provides a set of pre-defined models based on Recurrent neural network for sequence to sequence problems. (See more details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/seq2seq/))
## _Reference use cases_
Analytics Zoo provides a collection of end-to-end reference use cases, including *time series anomaly detection*, *sentiment analysis*, *fraud detection*, *image similarity*, etc. (See more details [here](https://analytics-zoo.github.io/master/#ProgrammingGuide/usercases-overview/))
## _Docker images and builders_
### _Analytics-Zoo in Docker_
**By default, the Analytics-Zoo image has installed below packages:**
- git
- maven
- Oracle jdk 1.8.0_152 (in /opt/jdk1.8.0_152)
- python 2.7.6 or 3.6.7
- pip
- numpy
- scipy
- pandas
- scikit-learn
- matplotlib
- seaborn
- jupyter
- wordcloud
- moviepy
- requests
- tensorflow_
- spark-${SPARK_VERSION} (in /opt/work/spark-${SPARK_VERSION})
- Analytics-Zoo distribution (in /opt/work/analytics-zoo-${ANALYTICS_ZOO_VERSION})
- Analytics-Zoo source code (in /opt/work/analytics-zoo)
**The work dir for Analytics-Zoo is /opt/work.**
- download-analytics-zoo.sh is used for downloading Analytics-Zoo distributions.
- start-notebook.sh is used for starting the jupyter notebook. You can specify the environment settings and spark settings to start a specified jupyter notebook.
- analytics-Zoo-${ANALYTICS_ZOO_VERSION} is the Analytics-Zoo home of Analytics-Zoo distribution.
- analytics-zoo-SPARK_x.x-x.x.x-dist.zip is the zip file of Analytics-Zoo distribution.
- spark-${SPARK_VERSION} is the Spark home.
- analytics-zoo is cloned from https://github.com/intel-analytics/analytics-zoo, contains apps, examples using analytics-zoo.
### _How to build it_
**By default, you can build a Analytics-Zoo:default image with latest nightly-build Analytics-Zoo distributions:**
```bash
sudo docker build --rm -t intelanalytics/analytics-zoo:default .
```
**If you need http and https proxy to build the image:**
```bash
sudo docker build \
--build-arg http_proxy=http://your-proxy-host:your-proxy-port \
--build-arg https_proxy=https://your-proxy-host:your-proxy-port \
--rm -t intelanalytics/analytics-zoo:default .
```
**If you need python 3 to build the image:**
```bash
sudo docker build \
--build-arg PY_VERSION_3=YES \
--rm -t intelanalytics/analytics-zoo:default-py3 .
```
**You can also specify the ANALYTICS_ZOO_VERSION and SPARK_VERSION to build a specific Analytics-Zoo image:**
```bash
sudo docker build \
--build-arg http_proxy=http://your-proxy-host:your-proxy-port \
--build-arg https_proxy=https://your-proxy-host:your-proxy-port \
--build-arg ANALYTICS_ZOO_VERSION=0.3.0 \
--build-arg BIGDL_VERSION=0.6.0 \
--build-arg SPARK_VERSION=2.3.1 \
--rm -t intelanalytics/analytics-zoo:0.3.0-bigdl_0.6.0-spark_2.3.1 .
```
### _How to use the image_
**To start a notebook directly with a specified port(e.g. 12345). You can view the notebook on http://[host-ip]:12345**
```bash
sudo docker run -it --rm -p 12345:12345 \
-e NotebookPort=12345 \
-e NotebookToken="your-token" \
intelanalytics/analytics-zoo:default
sudo docker run -it --rm --net=host \
-e NotebookPort=12345 \
-e NotebookToken="your-token" \
intelanalytics/analytics-zoo:default
sudo docker run -it --rm -p 12345:12345 \
-e NotebookPort=12345 \
-e NotebookToken="your-token" \
intelanalytics/analytics-zoo:0.3.0-bigdl_0.6.0-spark_2.3.1
sudo docker run -it --rm --net=host \
-e NotebookPort=12345 \
-e NotebookToken="your-token" \
intelanalytics/analytics-zoo:0.3.0-bigdl_0.6.0-spark_2.3.1
```
**If you need http and https proxy in your environment:**
```bash
sudo docker run -it --rm -p 12345:12345 \
-e NotebookPort=12345 \
-e NotebookToken="your-token" \
-e http_proxy=http://your-proxy-host:your-proxy-port \
-e https_proxy=https://your-proxy-host:your-proxy-port \
intelanalytics/analytics-zoo:default
sudo docker run -it --rm --net=host \
-e NotebookPort=12345 \
-e NotebookToken="your-token" \
-e http_proxy=http://your-proxy-host:your-proxy-port \
-e https_proxy=https://your-proxy-host:your-proxy-port \
intelanalytics/analytics-zoo:default
sudo docker run -it --rm -p 12345:12345 \
-e NotebookPort=12345 \
-e NotebookToken="your-token" \
-e http_proxy=http://your-proxy-host:your-proxy-port \
-e https_proxy=https://your-proxy-host:your-proxy-port \
intelanalytics/analytics-zoo:0.3.0-bigdl_0.6.0-spark_2.3.1
sudo docker run -it --rm --net=host \
-e NotebookPort=12345 \
-e NotebookToken="your-token" \
-e http_proxy=http://your-proxy-host:your-proxy-port \
-e https_proxy=https://your-proxy-host:your-proxy-port \
intelanalytics/analytics-zoo:0.3.0-bigdl_0.6.0-spark_2.3.1
```
**You can also start the container first**
```bash
sudo docker run -it --rm --net=host \
-e NotebookPort=12345 \
-e NotebookToken="your-token" \
intelanalytics/analytics-zoo:default bash
```
**In the container, after setting proxy and ports, you can start the Notebook by:**
```bash
/opt/work/start-notebook.sh
```
### _Notice_
**If you need nightly build version of Analytics-Zoo, please pull the image form Dockerhub with:**
```bash
sudo docker pull intelanalytics/analytics-zoo:latest
```
**Please follow the readme in each app folder to test the jupyter notebooks !!!**
**With 0.3+ version of Anaytics-Zoo Docker image, you can specify the runtime conf of spark**
```bash
sudo docker run -itd --net=host \
-e NotebookPort=12345 \
-e NotebookToken="1234qwer" \
-e http_proxy=http://your-proxy-host:your-proxy-port \
-e https_proxy=https://your-proxy-host:your-proxy-port \
-e RUNTIME_DRIVER_CORES=4 \
-e RUNTIME_DRIVER_MEMORY=20g \
-e RUNTIME_EXECUTOR_CORES=4 \
-e RUNTIME_EXECUTOR_MEMORY=20g \
-e RUNTIME_TOTAL_EXECUTOR_CORES=4 \
intelanalytics/analytics-zoo:latest
```