# pyspark-tutorial
**Repository Path**: jianchengss/pyspark-tutorial
## Basic Information
- **Project Name**: pyspark-tutorial
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-03-05
- **Last Updated**: 2024-05-30
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# PySpark Tutorial
* PySpark is the Python API for Spark.
* The purpose of PySpark tutorial is to provide
basic distributed algorithms using PySpark.
* PySpark supports two types of Data Abstractions:
* RDDs
* DataFrames
* **PySpark Interactive Mode**: has an interactive shell
(`$SPARK_HOME/bin/pyspark`) for basic testing
and debugging and is not supposed to be used
for production environment.
* **PySpark Batch Mode**: you may use `$SPARK_HOME/bin/spark-submit`
command for running PySpark programs (may be used for
testing and production environemtns)
------
# [Glossary: big data, MapReduce, Spark](https://github.com/mahmoudparsian/big-data-mapreduce-course/blob/master/slides/glossary/README.md)
------
# [Basics of PySpark with Examples](./howto/README.md)
------
# PySpark Examples and Tutorials
* [PySpark Examples: RDDs](./tutorial/pyspark-examples/rdds/)
* [PySpark Examples: DataFramess](./tutorial/pyspark-examples/dataframes/)
* [DNA Base Counting](./tutorial/dna-basecount/README.md)
* [Classic Word Count](./tutorial/wordcount)
* [Find Frequency of Bigrams](./tutorial/bigrams)
* [Join of Two Relations R(K, V1), S(K, V2)](./tutorial/basic-join)
* [Basic Mapping of RDD Elements](./tutorial/basic-map)
* [How to add all RDD elements together](./tutorial/basic-sum)
* [How to multiply all RDD elements together](./tutorial/basic-multiply)
* [Find Top-N and Bottom-N](./tutorial/top-N)
* [Find average by using combineByKey()](./tutorial/combine-by-key)
* [How to filter RDD elements](./tutorial/basic-filter)
* [How to find average](./tutorial/basic-average)
* [Cartesian Product: rdd1.cartesian(rdd2)](./tutorial/cartesian)
* [Sort By Key: sortByKey() ascending/descending](./tutorial/basic-sort)
* [How to Add Indices](./tutorial/add-indices)
* [Map Partitions: mapPartitions() by Examples](./tutorial/map-partitions/README.md)
* [Monoid: Design Principle](https://github.com/mahmoudparsian/data-algorithms-with-spark/blob/master/wiki-spark/docs/monoid/README.md)
------
# Books
### [Data Algorithms with Spark](https://github.com/mahmoudparsian/data-algorithms-with-spark/)
### [Data Algorithms](https://github.com/mahmoudparsian/data-algorithms-book/)
### [PySpark Algorithms](https://github.com/mahmoudparsian/pyspark-algorithms/)
-----
# Miscellaneous
### [Download, Install Spark and Run PySpark](./howto/download_install_run_spark.md)
### [How to Minimize the Verbosity of Spark](./howto/minimize_verbosity.md)
-------
# PySpark Tutorial and References...
* [Getting started with PySpark - Part 1](http://www.mccarroll.net/blog/pyspark/)
* [Getting started with PySpark - Part 2](http://www.mccarroll.net/blog/pyspark2/index.html)
* [A really really fast introduction to PySpark](http://www.slideshare.net/hkarau/a-really-really-fast-introduction-to-py-spark-lightning-fast-cluster-computing-with-python-1)
* [PySpark](http://www.slideshare.net/thegiivee/pysaprk?qid=81cf1b31-8b19-4570-89a5-21d03cad6ecd&v=default&b=&from_search=9)
* [Basic Big Data Manipulation with PySpark](http://bigdatasciencebootcamp.com/posts/Part_3/basic_big_data.html)
* [Working in Pyspark: Basics of Working with Data and RDDs](http://www.learnbymarketing.com/618/pyspark-rdd-basics-examples/)
-------
# Questions/Comments
* [View Mahmoud Parsian's profile on LinkedIn](http://www.linkedin.com/in/mahmoudparsian)
* Please send me an email: mahmoud.parsian@yahoo.com
* [Twitter: @mahmoudparsian](http://twitter.com/mahmoudparsian)
Thank you!
````
best regards,
Mahmoud Parsian
````
-----
------
[//]: # (metadata:)
[//]: # (Spark, PySpark, Python)
[//]: # (MapReduce, Distributed Algorithms, mappers, reducers, partitioners)
[//]: # (Transformations, Actions, RDDs, DataFrames, SQL)