# pyspark-tutorial **Repository Path**: jianchengss/pyspark-tutorial ## Basic Information - **Project Name**: pyspark-tutorial - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-03-05 - **Last Updated**: 2024-05-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # PySpark Tutorial * PySpark is the Python API for Spark. * The purpose of PySpark tutorial is to provide basic distributed algorithms using PySpark. * PySpark supports two types of Data Abstractions: * RDDs * DataFrames * **PySpark Interactive Mode**: has an interactive shell (`$SPARK_HOME/bin/pyspark`) for basic testing and debugging and is not supposed to be used for production environment. * **PySpark Batch Mode**: you may use `$SPARK_HOME/bin/spark-submit` command for running PySpark programs (may be used for testing and production environemtns) ------ # [Glossary: big data, MapReduce, Spark](https://github.com/mahmoudparsian/big-data-mapreduce-course/blob/master/slides/glossary/README.md) ------ # [Basics of PySpark with Examples](./howto/README.md) ------ # PySpark Examples and Tutorials * [PySpark Examples: RDDs](./tutorial/pyspark-examples/rdds/) * [PySpark Examples: DataFramess](./tutorial/pyspark-examples/dataframes/) * [DNA Base Counting](./tutorial/dna-basecount/README.md) * [Classic Word Count](./tutorial/wordcount) * [Find Frequency of Bigrams](./tutorial/bigrams) * [Join of Two Relations R(K, V1), S(K, V2)](./tutorial/basic-join) * [Basic Mapping of RDD Elements](./tutorial/basic-map) * [How to add all RDD elements together](./tutorial/basic-sum) * [How to multiply all RDD elements together](./tutorial/basic-multiply) * [Find Top-N and Bottom-N](./tutorial/top-N) * [Find average by using combineByKey()](./tutorial/combine-by-key) * [How to filter RDD elements](./tutorial/basic-filter) * [How to find average](./tutorial/basic-average) * [Cartesian Product: rdd1.cartesian(rdd2)](./tutorial/cartesian) * [Sort By Key: sortByKey() ascending/descending](./tutorial/basic-sort) * [How to Add Indices](./tutorial/add-indices) * [Map Partitions: mapPartitions() by Examples](./tutorial/map-partitions/README.md) * [Monoid: Design Principle](https://github.com/mahmoudparsian/data-algorithms-with-spark/blob/master/wiki-spark/docs/monoid/README.md) ------ # Books ### [Data Algorithms with Spark](https://github.com/mahmoudparsian/data-algorithms-with-spark/) ### [Data Algorithms](https://github.com/mahmoudparsian/data-algorithms-book/) ### [PySpark Algorithms](https://github.com/mahmoudparsian/pyspark-algorithms/) ----- # Miscellaneous ### [Download, Install Spark and Run PySpark](./howto/download_install_run_spark.md) ### [How to Minimize the Verbosity of Spark](./howto/minimize_verbosity.md) ------- # PySpark Tutorial and References... * [Getting started with PySpark - Part 1](http://www.mccarroll.net/blog/pyspark/) * [Getting started with PySpark - Part 2](http://www.mccarroll.net/blog/pyspark2/index.html) * [A really really fast introduction to PySpark](http://www.slideshare.net/hkarau/a-really-really-fast-introduction-to-py-spark-lightning-fast-cluster-computing-with-python-1) * [PySpark](http://www.slideshare.net/thegiivee/pysaprk?qid=81cf1b31-8b19-4570-89a5-21d03cad6ecd&v=default&b=&from_search=9) * [Basic Big Data Manipulation with PySpark](http://bigdatasciencebootcamp.com/posts/Part_3/basic_big_data.html) * [Working in Pyspark: Basics of Working with Data and RDDs](http://www.learnbymarketing.com/618/pyspark-rdd-basics-examples/) ------- # Questions/Comments * [View Mahmoud Parsian's profile on LinkedIn](http://www.linkedin.com/in/mahmoudparsian) * Please send me an email: mahmoud.parsian@yahoo.com * [Twitter: @mahmoudparsian](http://twitter.com/mahmoudparsian) Thank you! ```` best regards, Mahmoud Parsian ```` ----- Data Algorithms with Spark Data Algorithms with Spark PySpark Algorithms Data Algorithms ------ [//]: # (metadata:) [//]: # (Spark, PySpark, Python) [//]: # (MapReduce, Distributed Algorithms, mappers, reducers, partitioners) [//]: # (Transformations, Actions, RDDs, DataFrames, SQL)