189 Star 773 Fork 244

GVP科学大数据开源社区 / PiFlow

Gitee — Enterprise-level DevOps R&D management platform
Join Gitee
Clone or download
Cancel
Notice: Creating folder will generate an empty file .keep, because not support in Git
Loading...
README.md


GitHub releases GitHub stars GitHub forks GitHub downloads GitHub issues

πFlow is an easy to use, powerful big data pipeline system. Try PiFlow v0.9 with: http://piflow.cstcloud.cn/

Table of Contents

Features

  • Easy to use
    • provide a WYSIWYG web interface to configure data flow
    • monitor data flow status
    • check the logs of data flow
    • provide checkpoints
  • Strong scalability:
    • Support customized development of data processing components
  • Superior performance
    • based on distributed computing engine Spark
  • Powerful
    • 100+ data processing components available
    • include spark、mllib、hadoop、hive、hbase、solr、redis、memcache、elasticSearch、jdbc、mongodb、http、ftp、xml、csv、json,etc.

Architecture

Requirements

  • JDK 1.8
  • Scala-2.11.8
  • Apache Maven 3.1.0 or newer
  • Spark-2.1.0、 Spark-2.2.0、 Spark-2.3.0
  • Hadoop-2.6.0
  • Apache Livy-0.7.1

Getting Started

To Build:

  • install external package

        mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/spark-xml_2.11-0.4.2.jar -DgroupId=com.databricks -DartifactId=spark-xml_2.11 -Dversion=0.4.2 -Dpackaging=jar
        mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/java_memcached-release_2.6.6.jar -DgroupId=com.memcached -DartifactId=java_memcached-release -Dversion=2.6.6 -Dpackaging=jar
        mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/ojdbc6-11.2.0.3.jar -DgroupId=oracle -DartifactId=ojdbc6 -Dversion=11.2.0.3 -Dpackaging=jar
        mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/edtftpj.jar -DgroupId=ftpClient -DartifactId=edtftp -Dversion=1.0.0 -Dpackaging=jar
  • mvn clean package -Dmaven.test.skip=true

        [INFO] Replacing original artifact with shaded artifact.
        [INFO] Reactor Summary:
        [INFO]
        [INFO] piflow-project ..................................... SUCCESS [  4.369 s]
        [INFO] piflow-core ........................................ SUCCESS [01:23 min]
        [INFO] piflow-configure ................................... SUCCESS [ 12.418 s]
        [INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
        [INFO] piflow-server ...................................... SUCCESS [02:05 min]
        [INFO] ------------------------------------------------------------------------
        [INFO] BUILD SUCCESS
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time: 06:01 min
        [INFO] Finished at: 2020-05-21T15:22:58+08:00
        [INFO] Final Memory: 118M/691M
        [INFO] ------------------------------------------------------------------------

Run Piflow Server:

  • run piflow server on Intellij:

    • download piflow: git clone https://github.com/cas-bigdatalab/piflow.git

    • import piflow into Intellij

    • edit config.properties file

    • build piflow to generate piflow jar:

      • Edit Configurations --> Add New Configuration --> Maven
      • Name: package
      • Command line: clean package -Dmaven.test.skip=true -X
      • run 'package' (piflow jar file will be built in ../piflow/piflow-server/target/piflow-server-0.9.jar)
    • run HttpService:

      • Edit Configurations --> Add New Configuration --> Application
      • Name: HttpService
      • Main class : cn.piflow.api.Main
      • Environment Variable: SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.6(change the path to your spark home)
      • run 'HttpService'
    • test HttpService:

      • run /../piflow/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartMockDataFlow.scala
      • change the piflow server ip and port to your configure
  • run piflow server by release version:

    • download piflow.tar.gz:
      https://github.com/cas-bigdatalab/piflow/releases/download/v1.1/piflow-server-v1.1.tar.gz

    • unzip piflow.tar.gz:
      tar -zxvf piflow.tar.gz

    • edit config.properties

    • run start.sh、stop.sh、 restart.sh、 status.sh

    • test piflow server

      • set PIFLOW_HOME
        • vim /etc/profile
          export PIFLOW_HOME=/yourPiflowPath/bin
          export PATH=$PATH:$PIFLOW_HOME/bin

        • command
          piflow flow start example/mockDataFlow.json
          piflow flow stop appID
          piflow flow info appID
          piflow flow log appID

          piflow flowGroup start example/mockDataGroup.json
          piflow flowGroup stop groupId
          piflow flowGroup info groupId

  • how to configure config.properties

    #spark and yarn config
    spark.master=yarn
    spark.deploy.mode=cluster
    
    #hdfs default file system
    fs.defaultFS=hdfs://10.0.86.191:9000
    
    #yarn resourcemanager.hostname
    yarn.resourcemanager.hostname=10.0.86.191
    
    #if you want to use hive, set hive metastore uris
    #hive.metastore.uris=thrift://10.0.88.71:9083
    
    #show data in log, set 0 if you do not want to show data in logs
    data.show=10
    
    #server port
    server.port=8002
    
    #h2db port
    h2.port=50002

Run Piflow Web:

Restful API:

  • flow json

    flow example
    
          {
    "flow": {
      "name": "MockData",
      "executorMemory": "1g",
      "executorNumber": "1",
      "uuid": "8a80d63f720cdd2301723b7461d92600",
      "paths": [
        {
          "inport": "",
          "from": "MockData",
          "to": "ShowData",
          "outport": ""
        }
      ],
      "executorCores": "1",
      "driverMemory": "1g",
      "stops": [
        {
          "name": "MockData",
          "bundle": "cn.piflow.bundle.common.MockData",
          "uuid": "8a80d63f720cdd2301723b7461d92604",
          "properties": {
            "schema": "title:String, author:String, age:Int",
            "count": "10"
          },
          "customizedProperties": {
      }
    },
    {
      "name": "ShowData",
      "bundle": "cn.piflow.bundle.external.ShowData",
      "uuid": "8a80d63f720cdd2301723b7461d92602",
      "properties": {
        "showNumber": "5"
      },
      "customizedProperties": {
    
      }
    }
    
    ]
    }
    }
  • CURL POST:

  • Command line:

    • set PIFLOW_HOME
      vim /etc/profile
      export PIFLOW_HOME=/yourPiflowPath/piflow-bin
      export PATH=$PATH:$PIFLOW_HOME/bin

    • command example
      piflow flow start yourFlow.json
      piflow flow stop appID
      piflow flow info appID
      piflow flow log appID

      piflow flowGroup start yourFlowGroup.json
      piflow flowGroup stop groupId
      piflow flowGroup info groupId

docker-started

  • pull piflow images
    docker pull registry.cn-hangzhou.aliyuncs.com/cnic_piflow/piflow:v1.1

  • show docker images
    docker images

  • run a container with piflow imageID , all services run automatically. Please Set HOST_IP.
    docker run -h master -itd --env HOST_IP=*.*.*.* --name piflow-v1.1 -p 6001:6001 -p 6002:6002 [imageID]

  • please visit "HOST_IP:6001", it may take a while

  • if somethings goes wrong, all the application are in /opt folder

use-interface

  • Login:

  • Dashboard:

  • Flow list:

  • Create flow:

  • Configure flow:

  • Load flow:

  • Monitor flow:

  • Flow logs:

  • Group list:

  • Configure group:

  • Monitor group:

  • Process List:

  • Template List:

  • DataSource List:

  • Schedule List:

  • StopHub List:

Contact Us

  • Name:吴老师
  • Mobile Phone:18910263390
  • WeChat:18910263390
  • Email: wzs@cnic.cn
  • QQ Group:1003489545

Repository Comments ( 15 )

Sign in for post a comment

About

混合型科学大数据流水线系统,包含丰富的处理器组件,提供Shell、DSL、Web配置界面、任务调度、任务监控等功能 expand collapse
Scala and 5 more languages
Apache-2.0
Cancel

Releases

No release

PiFlow

Contributors

All

Activities

load more
can not load any more
Scala
1
https://gitee.com/opensci/piflow.git
git@gitee.com:opensci/piflow.git
opensci
piflow
PiFlow
master

Search

181749 a2d7925e 1850385 181749 9f8568a7 1850385