# simple-pipeline
**Repository Path**: mirrors_databricks/simple-pipeline
## Basic Information
- **Project Name**: simple-pipeline
- **Description**: Example pipeline for bit.io
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-10-19
- **Last Updated**: 2025-10-12
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# bit.io simple pipeline
A simple bit.io pipeline example using scripts and the UNIX cron scheduler.
## Scope
This repo is intended to provide a simple pipeline example for getting started with programmtic data ingestion and updates in bit.io. To keep the repo simple, many best practices such as logging, configuration files, and a more robust orchestration/scheduling framework are omitted.
## Setup
- Add a .env file at the root with your own bit.io Postgres connection string as `PG_CONN_STRING`
- Create environment
- `python3 -m venv venv`
- `source venv/bin/activate`
- `python3 -m pip install --upgrade pip -r requirements.txt`
- Create a repo on bit.io, we named ours `simple_pipeline` for this demo
## Contents
- simple_pipeline
- main.py # command line script for ETL jobs
- extract.py # Handles extraction of data into a pandas DataFrame
- transform.py # Transforms data using pandas
- load.py # Loads data from pandas to bit.io
- sql_executor.py # Runs arbitrary SQL scripts on bit.io
- ca_covid_data.sql # Example SQL script for bit.io
- acs_5yr_population_data.csv # Population data, this changes annually
- README.md
- requirements.txt
- scheduled_run.sh # This shows how to batch calls to the python scripts together for a simple pipeline
- LICENSE
## Usage
As a demo piece, this simple pipeline contains two main data processing scripts:
1. `simple_pipeline/main.py` extracts, transforms (optional), and loads a csv from a URL or local file into bit.io
2. `simple_pipeline/sql_executor.py` executes SQL scripts on bit.io, such as for creating joined, de-normalized tables
In addition, a shell script `scheduled_run.sh` is included to show how the two scripts can be composed to form a simple pipeline. Utility programs like `cron` can then be used to run the shell script on a schedule for automated updates in bit.io. Here is an example `crontab` job that I created on my local system for this pipeline:
`45 09 * * * cd ~/Documents/simple_pipeline && ./scheduled_run.sh`
The `45 09 * * *` defines a schedule of once daily, at 9:45. You can learn more about cron syntax at [crontab.guru](https://crontab.guru/).
## Using simple_pipeline/main.py
This is a simple extract, transform, load script. The main script `main.py` can be run from the command line as follows:
`python simple_pipeline/main.py `
The script also takes a `-local_source` option that indicates the source is a local file path (default is a URL) and a `-name` option with an argument for a transformation function to run. Here is an example command for a URL source with a transformation function called "nyt_cases_counties":
`python main.py -name nyt_cases_counties https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv bitdotio/simple_pipeline.cases_counties`
Here is an example command that uses a local file and skips the transformation step (note that no `-name` specified):
`python main.py -local_source -name acs_population_counties acs_5yr_population_data.csv bitdotio/simple_pipeline.population_counties`
The transformation functions are defined in `transform.py`. If you want to run these examples, make sure to update the destination with your own username in place of `bitdotio` and your own repo name if it is different from `simple_pipeline`.
## Using simple_pipeline/sql_executor.py
Once data has been extracted, transformed, and loaded, we sometimes want to create derived tables within the database. This script takes one argument, a path to a SQL script to run on bit.io. For example, to create the derived California COVID data table, the script is called as follows:
`python sql_executor.py ca_covid_data.sql bitdotio simple_pipeline`