# spark-select
**Repository Path**: mirrors_minio/spark-select
## Basic Information
- **Project Name**: spark-select
- **Description**: A library for Spark DataFrame using MinIO Select API
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-08-09
- **Last Updated**: 2025-10-19
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# MinIO Spark Select
MinIO Spark select enables retrieving only required data from an object using Select API.
## Requirements
This library requires
- Spark 2.3+
- Scala 2.11+
## Features
- S3 Select is supported with CSV, JSON and Parquet files using `minioSelectCSV`, `minioSelectJSON` and `minioSelectParquet` values to specify the data format.
- S3 Select supports select on multiple objects.
- S3 Select supports querying SSE-C encrypted objects.
### Limitations
- Spark CSV and JSON options such as nanValue, positiveInf, negativeInf, and options related to corrupt records (for example, failfast and dropmalformed mode) are not supported.
- Using commas (,) within decimals is not supported. For example, 10,000 is not supported and 10000 is.
- The following filters are not pushed down to MinIO:
- Aggregate functions such as COUNT() and SUM().
- Filters that CAST() an attribute. For example, CAST(stringColumn as INT) = 1.
- Filters with an attribute that is an object or is complex. For example, intArray[1] = 1, objectColumn.objectNumber = 1.
- Filters for which the value is not a literal value. For example, intColumn1 = intColumn2
- Only Select [Supported Data Types](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-data-types.html) are supported with the documented limitations.
### HowTo
Include this package in your Spark Applications using:
#### *spark-shell*, *pyspark*, or *spark-submit*
```
> $SPARK_HOME/bin/spark-shell --packages io.minio:spark-select_2.11:2.1
```
#### *sbt*
If you use the [sbt-spark-package plugin](http://github.com/databricks/sbt-spark-package), in your sbt build file, add:
```
spDependencies += "minio/spark-select:2.1"
```
Otherwise,
```
libraryDependencies += "io.minio" % "spark-select_2.11" % "2.1"
```
#### *Maven*
In your pom.xml, add:
```xml
io.minio
spark-select_2.11
2.1
```
#### *Source*
Setup all required environment variables
> NOTE: It is assumed that you have already installed hadoop-2.8.5, spark 2.3.1 at some locations locally.
```
export HADOOP_HOME=${HOME}/spark/hadoop-2.8.5/
export PATH=${PATH}:${HADOOP_HOME}/bin
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export SPARK_HOME=${HOME}/spark/spark-2.3.1-bin-without-hadoop/
export PATH=${PATH}:${SPARK_HOME}/bin
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/
git clone https://github.com/minio/spark-select
sbt assembly
spark-shell --jars target/scala-2.11/spark-select-assembly-2.1.jar
```
Once the `spark-shell` has been successfully invoked.
```
scala> :load examples/csv.scala
Loading examples/csv.scala...
import org.apache.spark.sql._
import org.apache.spark.sql.types._
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,false))
df: org.apache.spark.sql.DataFrame = [name: string, age: int]
+-------+---+
| name|age|
+-------+---+
|Michael| 31|
| Andy| 30|
| Justin| 19|
+-------+---+
scala>
```
### API
#### *PySpark*
```py
spark
.read
.format("minioSelectCSV") // "minioSelectJSON" for JSON or "minioSelectParquet" for Parquet
.schema(...) // mandatory
.options(...) // optional
.load("s3://path/to/my/datafiles")
```
#### *R*
```
read.df("s3://path/to/my/datafiles", "minioSelectCSV", schema)
```
#### *Scala*
```
spark
.read
.format("minioSelectCSV") // "minioSelectJSON" for JSON or "minioSelectParquet" for Parquet
.schema(...) // mandatory
.options(...) // optional. Examples:
// .options(Map("quote" -> "\'", "header" -> "true")) or
// .option("quote", "\'").option("header", "true")
.load("s3://path/to/my/datafiles")
```
#### *SQL*
```
CREATE TEMPORARY VIEW MyView (number INT, name STRING) USING minioSelectCSV OPTIONS (path "s3://path/to/my/datafiles")
```
### Options
The following options are available when using `minioSelectCSV` and `minioSelectJSON`. If not specified, default values are used.
#### *Options with minioSelectCSV*
| Option | Default | Usage |
|---|---|---|
| `compression` | "none" | Indicates whether compression is used. "gzip", "bzip2" are values supported besides "none".
| `delimiter` | "," | Specifies the field delimiter.
| `quote` | '"' | Specifies the quote character. Specifying an empty string is not supported and results in a malformed XML error.
| `escape` | '"' | Specifies the quote escape character.
| `header` | "true" | "false" specifies that there is no header. "true" specifies that a header is in the first line. Only headers in the first line are supported, and empty lines before a header are not supported.
| `comment` | "#" | Specifies the comment character.
#### *Options with minioSelectJSON*
| Option | Default | Usage |
|---|---|---|
| `compression` | "none" | Indicates whether compression is used. "gzip", "bzip2" are values supported besides "none".
| `multiline` | "false" | "false" specifies that the JSON is in Select LINES format, meaning that each line in the input data contains a single JSON object. "true" specifies that the JSON is in Select DOCUMENT format, meaning that a JSON object can span multiple lines in the input data.
#### *Options with minioSelectParquet*
There are no **options** needed with Parquet files.
### Full Examples
#### *Scala*
Schema with two columns for `CSV`.
```scala
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object app {
def main(args: Array[String]) {
val schema = StructType(
List(
StructField("name", StringType, true),
StructField("age", IntegerType, false)
)
)
val df = spark
.read
.format("minioSelectCSV")
.schema(schema)
.load("s3://sjm-airlines/people.csv")
println(df.show())
println(df.select("*").filter("age > 19").show())
}
}
```
With custom schema for `JSON`.
```scala
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object app {
def main(args: Array[String]) {
val schema = StructType(
List(
StructField("name", StringType, true),
StructField("age", IntegerType, false)
)
)
val df = spark
.read
.format("minioSelectJSON")
.schema(schema)
.load("s3://sjm-airlines/people.json")
println(df.show())
println(df.select("*").filter("age > 19").show())
}
}
```
With custom schema for `Parquet`.
```scala
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object app {
def main(args: Array[String]) {
val schema = StructType(
List(
StructField("name", StringType, true),
StructField("age", IntegerType, false)
)
)
val df = spark
.read
.format("minioSelectParquet")
.schema(schema)
.load("s3://sjm-airlines/people.parquet")
println(df.show())
println(df.select("*").filter("age > 19").show())
}
}
```
#### *Python*
Schema with two columns for `CSV`.
```py
from pyspark.sql import *
from pyspark.sql.types import *
if __name__ == "__main__":
# create SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("spark-select in python") \
.getOrCreate()
# filtered schema
st = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), False),
])
df = spark \
.read \
.format('minioSelectCSV') \
.schema(st) \
.load("s3://testbucket/people.csv")
# show all rows.
df.show()
# show only filtered rows.
df.select("*").filter("age > 19").show()
```
```
> $SPARK_HOME/bin/spark-submit --packages io.minio:spark-select_2.11:2.1
```