# mrjob **Repository Path**: mirrors_Asana/mrjob ## Basic Information - **Project Name**: mrjob - **Description**: Asana fork of mrjob with master at v0.4.2 with small changes - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-09-24 - **Last Updated**: 2025-12-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README mrjob ===== .. image:: http://github.com/yelp/mrjob/raw/master/docs/logos/logo_medium.png mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs. `Stable version (v0.4.2) documentation `_ `Development version documentation `_ .. image:: https://travis-ci.org/Yelp/mrjob.png :target: https://travis-ci.org/Yelp/mrjob mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. It also works with your own Hadoop cluster. Some important features: * Run jobs on EMR, your own Hadoop cluster, or locally (for testing). * Write multi-step jobs (one map-reduce step feeds into the next) * Duplicate your production environment inside Hadoop * Upload your source tree and put it in your job's ``$PYTHONPATH`` * Run make and other setup scripts * Set environment variables (e.g. ``$TZ``) * Easily install python packages from tarballs (EMR only) * Setup handled transparently by ``mrjob.conf`` config file * Automatically interpret error logs from EMR * SSH tunnel to hadoop job tracker on EMR * Minimal setup * To run on EMR, set ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY`` * To run on your Hadoop cluster, install ``simplejson`` and make sure ``$HADOOP_HOME`` is set. Installation ------------ From PyPI: ``pip install mrjob`` From source: ``python setup.py install`` A Simple Map Reduce Job ----------------------- Code for this example and more live in ``mrjob/examples``. .. code:: python """The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def combiner(self, word, counts): yield (word, sum(counts)) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() Try It Out! ----------- :: # locally python mrjob/examples/mr_word_freq_count.py README.rst > counts # on EMR python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts # on your Hadoop cluster python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts Setting up EMR on Amazon ------------------------ * create an `Amazon Web Services account `_ * sign up for `Elastic MapReduce `_ * Get your access and secret keys (click "Security Credentials" on `your account page `_) * Set the environment variables ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY`` accordingly Advanced Configuration ---------------------- To run in other AWS regions, upload your source tree, run ``make``, and use other advanced mrjob features, you'll need to set up ``mrjob.conf``. mrjob looks for its conf file in: * The contents of ``$MRJOB_CONF`` * ``~/.mrjob.conf`` * ``/etc/mrjob.conf`` See `the mrjob.conf documentation `_ for more information. Project Links ------------- * `Source code `_ * `Documentation `_ * `Discussion group `_ Reference --------- * `Hadoop Streaming `_ * `Elastic MapReduce `_ More Information ---------------- * `PyCon 2011 mrjob overview `_ * `Introduction to Recommendations and MapReduce with mrjob `_ (`source code `_) * `Social Graph Analysis Using Elastic MapReduce and PyPy `_ Thanks to `Greg Killion `_ (`blind-works.net `_) for the logo.