# hadoop-data-pipeline
**Repository Path**: vidi_0312/hadoop-data-pipeline
## Basic Information
- **Project Name**: hadoop-data-pipeline
- **Description**: copy from github
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 1
- **Created**: 2018-03-22
- **Last Updated**: 2020-12-19
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
                                                                                                                                              [ ](https://github.com/sainib/hadoop-data-pipeline/blob/master/READMEs/1_PREP.md)
# Flow
## Objective of the demo
The goal of this data pipeline flow is to demonstrate a typical (but simpler) ETL flow in Hadoop using Falcon and Atlas.
## Introduction
As part of this flow, we will ingest data files, that are copied to the landing zone on a gateway server, and then process them at a regular interval automatically using Falcon. When the workflow begins, the files are ingested, stored, transformed and the transformed data is sqooped out of cluster into a MySQL database. 
Once the data is processed, the hive processing lineage will be available in Apache Atlas (optional). 
In this flow, there are following processes and steps 
* Gateway Server
	* has the flume agent running with a Spooling Directory configuration, to transfer the data files to HDFS
	
* Master Server 
	* has the Falcon job running to do the following	
		* Clear the temp tables
		* Make a copy of the incoming data for backup purpose. 
		* Insert the raw data in a temp table
		* Covert the XML data into JSON and insert into another table 
		* Apply aggregation / transformation process to the JSON table 
		* Also insert the data from process (temp table) into the history table.
		* export the data out of Hive to a Mysql table
		
## Architecture
		
## Setting up the project	
Pre-requisite - Download and install [Hortonworks Sandbox](http://www.hortonworks.com/sandbox)
```
IMPORTANT NOTE : 
This demo assumes that the automated process will be running as user - admin. 
So, it is necessary to make sure that file and directory permissions are adjusted appropriately, if you plan on using another user.
```
                                                                                                                                              [
](https://github.com/sainib/hadoop-data-pipeline/blob/master/READMEs/1_PREP.md)
# Flow
## Objective of the demo
The goal of this data pipeline flow is to demonstrate a typical (but simpler) ETL flow in Hadoop using Falcon and Atlas.
## Introduction
As part of this flow, we will ingest data files, that are copied to the landing zone on a gateway server, and then process them at a regular interval automatically using Falcon. When the workflow begins, the files are ingested, stored, transformed and the transformed data is sqooped out of cluster into a MySQL database. 
Once the data is processed, the hive processing lineage will be available in Apache Atlas (optional). 
In this flow, there are following processes and steps 
* Gateway Server
	* has the flume agent running with a Spooling Directory configuration, to transfer the data files to HDFS
	
* Master Server 
	* has the Falcon job running to do the following	
		* Clear the temp tables
		* Make a copy of the incoming data for backup purpose. 
		* Insert the raw data in a temp table
		* Covert the XML data into JSON and insert into another table 
		* Apply aggregation / transformation process to the JSON table 
		* Also insert the data from process (temp table) into the history table.
		* export the data out of Hive to a Mysql table
		
## Architecture
		
## Setting up the project	
Pre-requisite - Download and install [Hortonworks Sandbox](http://www.hortonworks.com/sandbox)
```
IMPORTANT NOTE : 
This demo assumes that the automated process will be running as user - admin. 
So, it is necessary to make sure that file and directory permissions are adjusted appropriately, if you plan on using another user.
```
                                                                                                                                              [ ](https://github.com/sainib/hadoop-data-pipeline/blob/master/READMEs/1_PREP.md)
](https://github.com/sainib/hadoop-data-pipeline/blob/master/READMEs/1_PREP.md)