Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight.
At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.
Essentially, we turn Spark into an in-memory operational database capable of transactions, point reads, writes, working with Streams (Spark) and running analytic SQL queries. Or, it is an in-memory scale out Hybrid Database that can execute Spark code, SQL or even Objects.
If you are already using Spark, experience 20x speed up for your query performance. Try out this test
We provide multiple options to get going with SnappyData. The easiest option is, if you are already using Spark 2.0+. You can simply get started by adding SnappyData as a package dependency. You can find more information on options for running SnappyData here.
Refer to the 5 minutes guide which is intended for both first time and experienced SnappyData users. It provides you with references and common examples to help you get started quickly!
To understand SnappyData and its features refer to the documentation
We monitor channels listed below for comments/questions.
Stackoverflow Slack Gitter Mailing List Reddit JIRA
Using Maven Dependency SnappyData artifacts are hosted in Maven Central. You can add a Maven dependency with the following coordinates:
groupId: io.snappydata
artifactId: snappydata-core_2.11
version: 0.7
groupId: io.snappydata
artifactId: snappydata-cluster_2.11
version: 0.7
Using sbt If you are using sbt, add this line to your build.sbt for core SnappyData artifacts:
libraryDependencies += "io.snappydata" % "snappydata-core_2.11" % "0.7"
For additions related to SnappyData cluster, use:
libraryDependencies += "io.snappydata" % "snappydata-cluster_2.11" % "0.7"
You can find more specific SnappyData artifacts here
Here is a stream + Transactions + Analytics use case example to illustrate the SQL as well as the Spark programming approaches in SnappyData - Ad Analytics code example. Here is a screencast that showcases many useful features of SnappyData. The example also goes through a benchmark comparing SnappyData to a Hybrid in-memory database and Cassandra.
If you are interested in contributing, please visit the community page for ways in which you can help.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。