# ml-workspace **Repository Path**: denhill/ml-workspace ## Basic Information - **Project Name**: ml-workspace - **Description**: 🛠 All-in-one web-based IDE specialized for machine learning and data science. - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-10-22 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
All-in-one web-based development environment for machine learning
Getting Started • Features & Screenshots • Support • Report a Bug • FAQ • Known Issues • Contribution
The ML workspace is an all-in-one web-based IDE specialized for machine learning and data science. It is simple to deploy and gets you started within minutes to productively built ML solutions on your own machines. This workspace is the ultimate tool for developers preloaded with a variety of popular data science libraries (e.g., Tensorflow, PyTorch, Keras, Sklearn) and dev tools (e.g., Jupyter, VS Code, Tensorboard) perfectly configured, optimized, and integrated. ## Highlights - 💫 Jupyter, JupyterLab, and Visual Studio Code web-based IDEs. - 🗃 Pre-installed with many popular data science libraries & tools. - 🖥 Full Linux desktop GUI accessible via web browser. - 🔀 Seamless Git integration optimized for notebooks. - 📈 Integrated hardware & training monitoring via Tensorboard & Netdata. - 🚪 Access from anywhere via Web, SSH, or VNC under a single port. - 🎛 Usable as remote kernel (Jupyter) or remote machine (VS Code) via SSH. - 🐳 Easy to deploy on Mac, Linux, and Windows via Docker.| Variable | Description | Default |
|---|---|---|
| WORKSPACE_BASE_URL | The base URL under which Jupyter and all other tools will be reachable from. | / |
| WORKSPACE_SSL_ENABLED | Enable or disable SSL. When set to true, either certificate (cert.crt) must be mounted to /resources/ssl or, if not, the container generates self-signed certificate. |
false |
| WORKSPACE_AUTH_USER | Basic auth user name. To enable basic auth, both the user and password need to be set. We recommend to use the AUTHENTICATE_VIA_JUPYTER for securing the workspace. |
|
| WORKSPACE_AUTH_PASSWORD | Basic auth user password. To enable basic auth, both the user and password need to be set. We recommend to use the AUTHENTICATE_VIA_JUPYTER for securing the workspace. |
|
| WORKSPACE_PORT | Configures the main container-internal port of the workspace proxy. For most scenarios, this configuration should not be changed, and the port configuration via Docker should be used instead of the workspace should be accessible from a different port. | 8080 |
| CONFIG_BACKUP_ENABLED | Automatically backup and restore user configuration to the persisted /workspace folder, such as the .ssh, .jupyter, or .gitconfig from the users home directory. |
true |
| SHARED_LINKS_ENABLED | Enable or disable the capability to share resources via external links. This is used to enable file sharing, access to workspace-internal ports, and easy command-based SSH setup. All shared links are protected via a token. However, there are certain risks since the token cannot be easily invalidated after sharing and does not expire. | true |
| INCLUDE_TUTORIALS | If true, a selection of tutorial and introduction notebooks are added to the /workspace folder at container startup, but only in if the folder is empty. |
true |
| MAX_NUM_THREADS | The number of threads used for computations when using various common libraries (MKL, OPENBLAS, OMP, NUMBA, ...). You can also use auto to let the workspace dynamically determine the number of threads based on available CPU resources. This configuration can be overwritten by the user from within the workspace. Generally, it is good to set it at or below the number of CPUs available to the workspace. |
auto |
| Jupyter Configuration: | ||
| SHUTDOWN_INACTIVE_KERNELS | Automatically shutdown inactive kernels after a given timeout (to clean up memory or GPU resources). Value can be either a timeout in seconds or set to true with a default value of 48h. |
false |
| AUTHENTICATE_VIA_JUPYTER | If true, all HTTP requests will be authenticated against the Jupyter server, meaning that the authentication method configured with Jupyter will be used for all other tools as well. This can be deactivated with false. Any other value will activate this authentication and are applied as token via NotebookApp.token configuration of Jupyter. |
false |
| NOTEBOOK_ARGS | Add and overwrite Jupyter configuration options via command line args. Refer to this overview for all options. | |
| Variable | Description | Default |
|---|---|---|
| NVIDIA_VISIBLE_DEVICES | Controls which GPUs will be accessible inside the workspace. By default, all GPUs from the host are accessible within the workspace. You can either use all, none, or specify a comma-separated list of device IDs (e.g., 0,1). You can find out the list of available device IDs by running nvidia-smi on the host machine. |
all |
| CUDA_VISIBLE_DEVICES | Controls which GPUs CUDA applications running inside the workspace will see. By default, all GPUs that the workspace has access to will be visible. To restrict applications, provide a comma-separated list of internal device IDs (e.g., 0,2) based on the available devices within the workspace (run nvidia-smi). In comparison to NVIDIA_VISIBLE_DEVICES, the workspace user will be still able to access other GPUs by overwriting this configuration from within the workspace. |
|
| TF_FORCE_GPU_ALLOW_GROWTH | By default, the majority of GPU memory will be allocated by the first execution of a TensorFlow graph. While this behavior can be desirable for production pipelines, it is less desirable for interactive use. Use true to enable dynamic GPU Memory allocation or false to instruct TensorFlow to allocate all memory at execution. |
true |
Jupyter • Desktop GUI • VS Code • JupyterLab • Git Integration • File Sharing • Access Ports • Tensorboard • Extensibility • Hardware Monitoring • SSH Access • Remote Development • Job Execution
The workspace is equipped with a selection of best-in-class open-source development tools to help with the machine learning workflow. Many of these tools can be started from the `Open Tool` menu from Jupyter (the main application of the workspace):
> _Within your workspace you have **full root & sudo privileges** to install any library or tool you need via terminal (e.g., `pip`, `apt-get`, `conda`, or `npm`). You can find more ways to extend the workspace within the [Extensibility](#extensibility) section_
### Jupyter
[Jupyter Notebook](https://jupyter.org/) is a web-based interactive environment for writing and running code. The main building blocks of Jupyter are the file-browser, the notebook editor, and kernels. The file-browser provides an interactive file manager for all notebooks, files, and folders in the `/workspace` directory.
A new notebook can be created by clicking on the `New` drop-down button at the top of the list and selecting the desired language kernel.
> _You can spawn interactive **terminal** instances as well by selecting `New -> Terminal` in the file-browser._
The notebook editor enables users to author documents that include live code, markdown text, shell commands, LaTeX equations, interactive widgets, plots, and images. These notebook documents provide a complete and self-contained record of a computation that can be converted to various formats and shared with others.
> _This workspace has a variety of **third-party Jupyter extensions** activated. You can configure these extensions in the nbextensions configurator: `nbextensions` tab on the file browser_
The Notebook allows code to be run in a range of different programming languages. For each notebook document that a user opens, the web application starts a **kernel** that runs the code for that notebook and returns output. This workspace has a Python 3 kernel pre-installed. Additional Kernels can be installed to get access to other languages (e.g., R, Scala, Go) or additional computing resources (e.g., GPUs, CPUs, Memory).
> _**Python 2** is deprected and we do not recommend to use it. However, you can still install a Python 2.7 kernel via this command: `/bin/bash /resources/tools/python-27.sh`_
### Desktop GUI
This workspace provides an HTTP-based VNC access to the workspace via [noVNC](https://github.com/novnc/noVNC). Thereby, you can access and work within the workspace with a fully-featured desktop GUI. To access this desktop GUI, go to `Open Tool`, select `VNC`, and click the `Connect` button. In the case you are asked for a password, use `vncpassword`.
Once you are connected, you will see a desktop GUI that allows you to install and use full-fledged web-browsers or any other tool that is available for Ubuntu. Within the `Tools` folder on the desktop, you will find a collection of install scripts that makes it straightforward to install some of the most commonly used development tools, such as Atom, PyCharm, R-Runtime, R-Studio, or Postman (just double-click on the script).
**Clipboard:** If you want to share the clipboard between your machine and the workspace, you can use the copy-paste functionality as described below:
> 💡 _**Long-running tasks:** Use the desktop GUI for long-running Jupyter executions. By running notebooks from the browser of your workspace desktop GUI, all output will be synchronized to the notebook even if you have disconnected your browser from the notebook._
### Visual Studio Code
[Visual Studio Code](https://github.com/microsoft/vscode) (`Open Tool -> VS Code`) is an open-source lightweight but powerful code editor with built-in support for a variety of languages and a rich ecosystem of extensions. It combines the simplicity of a source code editor with powerful developer tooling, like IntelliSense code completion and debugging. The workspace integrates VS Code as a web-based application accessible through the browser-based on the awesome [code-server](https://github.com/cdr/code-server) project. It allows you to customize every feature to your liking and install any number of third-party extensions.


### Git Integration
Version control is a crucial aspect of productive collaboration. To make this process as smooth as possible, we have integrated a custom-made Jupyter extension specialized on pushing single notebooks, a full-fledged web-based Git client ([ungit](https://github.com/FredrikNoren/ungit)), a tool to open and edit plain text documents (e.g., `.py`, `.md`) as notebooks ([jupytext](https://github.com/mwouts/jupytext)), as well as a notebook merging tool ([nbdime](https://github.com/jupyter/nbdime)). Additionally, JupyterLab and VS Code also provide GUI-based Git clients.
#### Clone Repository
For cloning repositories via `https`, we recommend to navigate to the desired root folder and to click on the `git` button as shown below:
This might ask for some required settings and, subsequently, opens [ungit](https://github.com/FredrikNoren/ungit), a web-based Git client with a clean and intuitive UI that makes it convenient to sync your code artifacts. Within ungit, you can clone any repository. If authentication is required, you will get asked for your credentials.
#### Push, Pull, Merge, and Other Git Actions
To commit and push a single notebook to a remote Git repository, we recommend to use the Git plugin integrated into Jupyter, as shown below:
For more advanced Git operations, we recommend to use [ungit](https://github.com/FredrikNoren/ungit). With ungit, you can do most of the common git actions such as push, pull, merge, branch, tag, checkout, and many more.
#### Diffing and Merging Notebooks
Jupyter notebooks are great, but they often are huge files, with a very specific JSON file format. To enable seamless diffing and merging via Git this workspace is pre-installed with [nbdime](https://github.com/jupyter/nbdime). Nbdime understands the structure of notebook documents and, therefore, automatically makes intelligent decisions when diffing and merging notebooks. In the case you have merge conflicts, nbdime will make sure that the notebook is still readable by Jupyter, as shown below:
Furthermore, the workspace comes pre-installed with [jupytext](https://github.com/mwouts/jupytext), a Jupyter plugin that reads and writes notebooks as plain text files. This allows you to open, edit, and run scripts or markdown files (e.g., `.py`, `.md`) as notebooks within Jupyter. In the following screenshot, we have opened a markdown file via Jupyter:
In combination with Git, jupytext enables a clear diff history and easy merging of version conflicts. With both of those tools, collaborating on Jupyter notebooks with Git becomes straightforward.
### File Sharing
The workspace has a feature to share any file or folder with anyone via a token-protected link. To share data via a link, select any file or folder from the Jupyter directory tree and click on the share button as shown in the following screenshot:
This will generate a unique link protected via a token that gives anyone with the link access to view and download the selected data via the [Filebrowser](https://github.com/filebrowser/filebrowser) UI:
To deactivate or manage (e.g., provide edit permissions) shared links, open the Filebrowser via `Open Tool -> Filebrowser` and select `Settings->User Management`.
### Access Ports
It is possible to securely access any workspace internal port by selecting `Open Tool -> Access Port`. With this feature, you are able to access a REST API or web application running inside the workspace directly with your browser. The feature enables developers to build, run, test, and debug REST APIs or web applications directly from the workspace.
If you want to use an HTTP client or share access to a given port, you can select the `Get shareable link` option. This generates a token-secured link that anyone with access to the link can use to access the specified port.
> _The HTTP app requires to be resolved from a relative URL path or configure a base path (`/tools/PORT/`)._
> _The setup script only runs on Mac and Linux. Windows is currently not supported._
Just run the setup command or script on the machine from where you want to setup a connection to the workspace and input a name for the connection (e.g., `my-workspace`). You might also get asked for some additional input during the process, e.g. to install a remote kernel if `remote_ikernel` is installed. Once the passwordless SSH connection is successfully setup and tested, you can securely connect to the workspace by simply executing `ssh my-workspace`.
Besides the ability to execute commands on a remote machine, SSH also provides a variety of other features that can improve your development workflow as described in the following sections.
> 📖 _You can find additional features and information about the Remote SSH extension in [this guide](https://code.visualstudio.com/docs/remote/ssh)._
If you have opened a Tensorboard instance in a valid log directory, you will see the visualizations of your logged data:
> _Tensorboard can be used in combination with many other ML frameworks besides Tensorflow. By using the [tensorboardX](https://github.com/lanpa/tensorboardX) library you can log basically from any python based library. Also, PyTorch has a direct Tensorboard integration as described [here](https://pytorch.org/docs/stable/tensorboard.html)._
If you prefer to see the tensorboard directly within your notebook, you can make use of following **Jupyter magic**:
```
%load_ext tensorboard.notebook
%tensorboard --logdir /workspace/path/to/logs
```
### Hardware Monitoring
The workspace provides two pre-installed web-based tools to help developers during model training and other experimentation tasks to get insights into everything happening on the system and figure out performance bottlenecks.
[Netdata](https://github.com/netdata/netdata) (`Open Tool -> Netdata`) is a real-time hardware and performance monitoring dashboard that visualize the processes and services on your Linux systems. It monitors metrics about CPU, GPU, memory, disks, networks, processes, and more.
[Glances](https://github.com/nicolargo/glances) (`Open Tool -> Glances`) is a web-based hardware monitoring dashboard as well and can be used as an alternative to Netdata.
> _Netdata and Glances will show you the hardware statistics for the entire machine on which the workspace container is running._
### Run as a job
> _A job is defined as any computational task that runs for a certain time to completion, such as a model training or a data pipeline._
The workspace image can also be used to execute arbitrary Python code without starting any of the pre-installed tools. This provides a seamless way to productize your ML projects since the code that has been developed interactively within the workspace will have the same environment and configuration when run as a job via the same workspace image.
| Variable | Description | Default |
|---|---|---|
| VNC_PW | Password of VNC connection. This password only needs to be secure if the VNC server is directly exposed. If it is used via noVNC, it is already protected based on the configured authentication mechanism. | vncpassword |
| VNC_RESOLUTION | Default desktop resolution of VNC connection. When using noVNC, the resolution will be dynamically adapted to the window size. | 1600x900 |
| VNC_COL_DEPTH | Default color depth of VNC connection. | 24 |