1 Star 0 Fork 337

kyle_lu / PiFlow

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
BSD-2-Clause

PiFlow是一个简单易用,功能强大的大数据流水线系统。

目录

特性

  • 简单易用

    • 可视化配置流水线
    • 监控流水线
    • 查看流水线日志
    • 检查点功能
  • 扩展性强:

    • 支持自定义开发数据处理组件
  • 性能优越:

    • 基于分布式计算引擎Spark开发
  • 功能强大:

    • 提供100+的数据处理组件
    • 包括Hadoop 、Spark、MLlib、Hive、Solr、Redis、MemCache、ElasticSearch、JDBC、MongoDB、HTTP、FTP、XML、CSV、JSON等
    • 集成了微生物领域的相关算法

架构

要求

  • JDK 1.8 及以上版本
  • Apache Maven 3.1.0 及以上版本
  • Git Client
  • Spark-2.1.0 及以上版本
  • Hadoop-2.6.0 及以上版本

开始

如何Build: mvn clean package -Dmaven.test.skip=true

      [INFO] Replacing original artifact with shaded artifact.
      [INFO] Replacing /opt/project/piflow/piflow-server/target/piflow-server-0.9.jar with /opt/project/piflow/piflow-server/target/piflow-server-0.9-shaded.jar
      [INFO] ------------------------------------------------------------------------
      [INFO] Reactor Summary:
      [INFO] 
      [INFO] piflow-project ..................................... SUCCESS [  4.602 s]
      [INFO] piflow-core ........................................ SUCCESS [ 56.533 s]
      [INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
      [INFO] piflow-server ...................................... SUCCESS [03:01 min]
      [INFO] ------------------------------------------------------------------------
      [INFO] BUILD SUCCESS
      [INFO] ------------------------------------------------------------------------
      [INFO] Total time: 06:18 min
      [INFO] Finished at: 2018-12-24T16:54:16+08:00
      [INFO] Final Memory: 41M/812M
      [INFO] ------------------------------------------------------------------------

如何运行Piflow Server:

  • 使用Intellij Idea:

    • 编辑config.properties文件
    • build piflow工程,生成piflow-server.jar
    • 运行cn.piflow.api.Main
    • 切记设置SPARK_HOME
  • 直接运行release版本:

    • 下载release版本,地址:https://github.com/cas-bigdatalab/piflow/releases
    • 将build好的piflow-server.jar拷贝到piflow_release文件夹(由于git不能上传超过1G大文件,故需自行build piflow-server.jar)
    • 编辑config.properties文件
    • 运行start.sh 或者后台运行 nohup ./start.sh > piflow.log 2>&1 &
  • 如何配置config.properties

    #server ip and port
    server.ip=10.0.86.191
    server.port=8002
    h2.port=50002
    
    #spark and yarn config
    spark.master=yarn
    spark.deploy.mode=cluster
    yarn.resourcemanager.hostname=10.0.86.191
    yarn.resourcemanager.address=10.0.86.191:8032
    yarn.access.namenode=hdfs://10.0.86.191:9000
    yarn.stagingDir=hdfs://10.0.86.191:9000/tmp/
    yarn.jars=hdfs://10.0.86.191:9000/user/spark/share/lib/*.jar
    yarn.url=http://10.0.86.191:8088/ws/v1/cluster/apps/
    
    #hive config
    hive.metastore.uris=thrift://10.0.86.191:9083
    
    #piflow-server.jar path
    piflow.bundle=/opt/piflowServer/piflow-server-0.9.jar
    
    #checkpoint hdfs path
    checkpoint.path=hdfs://10.0.86.89:9000/piflow/checkpoints/
    
    #debug path
    debug.path=hdfs://10.0.88.191:9000/piflow/debug/
    
    #yarn url
    yarn.url=http://10.0.86.191:8088/ws/v1/cluster/apps/
    
    #the count of data shown in log
    data.show=10
    
    #h2 db port
    h2.port=50002

如何运行Piflow Web:

如何使用:

  • 命令行方式
    • 流水线样例配置

      {
        "flow":{
        "name":"test",
        "uuid":"1234",
        "checkpoint":"Merge",
        "stops":[
        {
          "uuid":"1111",
          "name":"XmlParser",
          "bundle":"cn.piflow.bundle.xml.XmlParser",
          "properties":{
              "xmlpath":"hdfs://10.0.86.89:9000/xjzhu/dblp.mini.xml",
              "rowTag":"phdthesis"
          }
        },
        {
          "uuid":"2222",
          "name":"SelectField",
          "bundle":"cn.piflow.bundle.common.SelectField",
          "properties":{
              "schema":"title,author,pages"
          }
      
        },
        {
          "uuid":"3333",
          "name":"PutHiveStreaming",
          "bundle":"cn.piflow.bundle.hive.PutHiveStreaming",
          "properties":{
              "database":"sparktest",
              "table":"dblp_phdthesis"
          }
        },
        {
          "uuid":"4444",
          "name":"CsvParser",
          "bundle":"cn.piflow.bundle.csv.CsvParser",
          "properties":{
              "csvPath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.csv",
              "header":"false",
              "delimiter":",",
              "schema":"title,author,pages"
          }
        },
        {
          "uuid":"555",
          "name":"Merge",
          "bundle":"cn.piflow.bundle.common.Merge",
          "properties":{
            "inports":"data1,data2"
          }
        },
        {
          "uuid":"666",
          "name":"Fork",
          "bundle":"cn.piflow.bundle.common.Fork",
          "properties":{
            "outports":"out1,out2,out3"
          }
        },
        {
          "uuid":"777",
          "name":"JsonSave",
          "bundle":"cn.piflow.bundle.json.JsonSave",
          "properties":{
            "jsonSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.json"
          }
        },
        {
          "uuid":"888",
          "name":"CsvSave",
          "bundle":"cn.piflow.bundle.csv.CsvSave",
          "properties":{
            "csvSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis_result.csv",
            "header":"true",
            "delimiter":","
          }
        }
      ],
      "paths":[
        {
          "from":"XmlParser",
          "outport":"",
          "inport":"",
          "to":"SelectField"
        },
        {
          "from":"SelectField",
          "outport":"",
          "inport":"data1",
          "to":"Merge"
        },
        {
          "from":"CsvParser",
          "outport":"",
          "inport":"data2",
          "to":"Merge"
        },
        {
          "from":"Merge",
          "outport":"",
          "inport":"",
          "to":"Fork"
        },
        {
          "from":"Fork",
          "outport":"out1",
          "inport":"",
          "to":"PutHiveStreaming"
        },
        {
          "from":"Fork",
          "outport":"out2",
          "inport":"",
          "to":"JsonSave"
        },
        {
          "from":"Fork",
          "outport":"out3",
          "inport":"",
          "to":"CsvSave"
        }
      ]

      } }

    • 运行命令

  • 访问piflow web: 试运行地址 "http://piflow.ml/piflow-web", user/password: admin/admin
    • 登录
    • 流水线列表
    • 流水线配置
    • 流水线资源配置
    • 运行流水线
    • 删除流水线
    • 流水线保存模板
    • 创建流水线:用户点击创建按钮,需要输入流水线名称及描述信息,同时可设置流水线需要的资源.
    • 配置流水线:用户可通过拖拽方式进行流水线的配置,方式类似visio
    • 搜索流水线组件:画布左边栏显示组件组和组件,可按关键字搜索,户选择好组件后可拖至画布中央
    • 流水线基本信息:画布右侧显示流水线基本信息,包括流水线名称及描述
    • 流水线配置:画布中央选择任一数据处理组件,右侧显示该数据处理组件的基本信息,包括名称描述,作者等信息.选择AttributeInfo tab,显示该数据处理组件的属性信息,用户可根据实际需求进行配置
    • 运行流水线:用户配置好流水线后,可点击运行按钮运行流水线
    • 流水线监控:进入流水线监控页面。监控页面会显示整条流水线的执行状况,包括运行状态、执行进度、执行时间等,击具体数据处理组件,显示该数据处理组件的运行状况,包括运行状态、执行时间。
    • 查看流水线日志
    • 运行中流水线列表: 已运行流水线会显示在Process List中,包括开始时间、结束时间、进度、状态等。同时可对已运行流水线进行查看,在运行,停止,和删除操作
    • 运行流水线检查点
    • 创建保存模板
    • 模板列表
    • 下载模板:模板会保存成xml文件存放到本地
    • 上传模板
    • 加载模板
BSD 2-Clause License Copyright (c) 2018, Zhihong SHEN All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

简介

混合型科学大数据流水线系统,包含丰富的处理器组件,提供Shell、DSL、Web配置界面、任务调度、任务监控等功能 展开 收起
Scala
BSD-2-Clause
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
Scala
1
https://gitee.com/appCIC/piflow.git
git@gitee.com:appCIC/piflow.git
appCIC
piflow
PiFlow
master

搜索帮助