2 Star 0 Fork 0

mirrors_datastax/spark-cassandra-connector

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
5_saving.md 2.25 KB
一键复制 编辑 原始数据 按行查看 历史

Documentation

Saving datasets to Cassandra

It is possible to save any RDD to Cassandra, not just CassandraRDD. The only requirement is that the object class of RDD is a tuple or has property names corresponding to Cassandra column names.

To save an RDD, import com.datastax.spark.connector._ and call the saveToCassandra method with the keyspace name, table name and a list of columns. Make sure to include at least all primary key columns.

Saving a collection of tuples

collection = sc.parallelize(Seq(("cat", 30), ("fox", 40)))
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
cqlsh:test> select * from words;

 word | count
------+-------
  bar |    20
  foo |    10
  cat |    30
  fox |    40

(4 rows)

Saving a collection of objects

When saving a collection of objects of a user-defined class, the items to be saved must provide appropriately named public property accessors for getting every column to be saved. This example provides more information on property-column naming conventions is described here.

case class WordCount(word: String, count: Long)
collection = sc.parallelize(Seq(WordCount("dog", 50), WordCount("cow", 60)))    
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
cqlsh:test> select * from words;

 word | count
------+-------
  bar |    20
  foo |    10
  cat |    30
  fox |    40
  dog |    50
  cow |    60

The driver will execute a CQL INSERT statement for every object in the RDD, grouped in unlogged batches. The consistency level for writes is ONE.

Tuning

The following properties set in SparkConf can be used to fine-tune the saving process:

  • spark.cassandra.output.batch.size.rows: number of rows per single batch; default is 'auto' which means the connector will adjust the number of rows based on the amount of data in each row
  • spark.cassandra.output.batch.size.bytes: maximum total size of the batch in bytes; defaults to 64 kB.
  • spark.cassandra.output.concurrent.writes: maximum number of batches executed in parallel by a single Spark task; defaults to 5

Next - Customizing the object mapping

Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/mirrors_datastax/spark-cassandra-connector.git
git@gitee.com:mirrors_datastax/spark-cassandra-connector.git
mirrors_datastax
spark-cassandra-connector
spark-cassandra-connector
b1.0

搜索帮助