Fregata is a light weight, super fast, large scale machine learning library based on Apache Spark, and it provides high-level APIs in Scala.
More accurate: For various problems, Fregata can achieve higher accuracy compared to MLLib.
Higher speed: For Generalized Linear Model, Fregata often converges in one data epoch. For a 1 billion X 1 billion data set, Fregata can train a Generalized Linear Model in 1 minute with memory caching or 10 minutes without it. Usually, Fregata is 10-100 times faster than MLLib.
Parameter Free: Fregata uses GSA SGD optimization, which dosen't require learning rate tuning, because we found a way to calculate appropriate learning rate in the training process. When confronted with super high-dimension problem, Fregata calculates remaining memory dynamically to determine the sparseness of the output, balancing accuracy and efficiency automatically. Both features enable Fregata to be treated as a standard module in data processing for different problems.
Lighter weight: Fregata just uses Spark's standard API, which allows it to be integrated into most business’ data processing flow on Spark quickly and seamlessly.
This documentation is about Fregata version 0.1
Fregata supports spark 1.x and 2.x with scala 2.10 and scala 2.11 .
Two ways to get Fregata by Maven or SBT :
<dependency>
<groupId>com.talkingdata.fregata</groupId>
<artifactId>core</artifactId>
<version>0.0.3</version>
</dependency>
<dependency>
<groupId>com.talkingdata.fregata</groupId>
<artifactId>spark</artifactId>
<version>0.0.3</version>
</dependency>
// if you deploy to local mvn repository please add
// resolvers += Resolver.mavenLocal
libraryDependencies += "com.talkingdata.fregata" % "core" % "0.0.3"
libraryDependencies += "com.talkingdata.fregata" % "spark" % "0.0.3"
If you want to manual deploy to local maven repository , as follow :
git clone https://github.com/TalkingData/Fregata.git
cd Fregata
mvn clean package install
Suppose that you're familiar with Spark, the example below shows how to use Fregata's Logistic Regression, and experimental datas can be obtained on LIBSVM Data
import fregata.spark.data.LibSvmReader
import fregata.spark.metrics.classification.{AreaUnderRoc, Accuracy}
import fregata.spark.model.classification.LogisticRegression
import org.apache.spark.{SparkConf, SparkContext}
val (_, trainData) = LibSvmReader.read(sc, trainPath, numFeatures.toInt)
val (_, testData) = LibSvmReader.read(sc, testPath, numFeatures.toInt)
val model = LogisticRegression.run(trainData)
val pd = model.classPredict(testData)
val auc = AreaUnderRoc.of( pd.map{
case ((x,l),(p,c)) =>
p -> l
})
Fregata's training API needs RDD[(fregata.Vector, fregata.Num)], predicting API needs the same or RDD[fregata.Vector] without label
import breeze.linalg.{Vector => BVector , SparseVector => BSparseVector , DenseVector => BDenseVector}
import fregata.vector.{SparseVector => VSparseVector }
package object fregata {
type Num = Double
type Vector = BVector[Num]
type SparseVector = BSparseVector[Num]
type SparseVector2 = VSparseVector[Num]
type DenseVector = BDenseVector[Num]
def zeros(n:Int) = BDenseVector.zeros[Num](n)
def norm(x:Vector) = breeze.linalg.norm(x,2.0)
def asNum(v:Double) : Num = v
}
// sc is Spark Context
// path is the location of input datas on HDFS
// numFeatures is the number of features for single instance
// minPartitions is the minimum number of partitions for the returned RDD pointing the input datas
read(sc:SparkContext, path:String, numFeatures:Int=-1, minPartition:Int=-1):(Int, RDD[(fregata.Vector, fregata.Num)])
else some constructions are needed
// indices is an 0-based Array and the index-th feature is not equal to zero
// values is an Array storing the corresponding value of indices
// length is the total features of each instance
// label is the instance's label
// input datas with label
sc.textFile(input).map{
val indicies = ...
val values = ...
val label = ...
...
(new SparseVector(indices, values, length).asInstanceOf[Vector], asNum(label))
}
// input datas without label(just for predicting API)
sc.textFile(input).map{
val indicies = ...
val values = ...
...
new SparseVector(indices, values, length).asInstanceOf[Vector]
}
// datas is the value of each feature
// label is the instance's label
// input datas with label
sc.textFile(input).map{
val datas = ...
val label = ...
...
(new DenseVector(datas).asInstanceOf[Vector], asNum(label))
}
// input datas without label(just for predicting API)
sc.textFile(input).map{
val datas = ...
...
new DenseVector(indices, values, length).asInstanceOf[Vector]
}
Contributed by TalkingData .
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。