# mrjob
**Repository Path**: mirrors_Asana/mrjob
## Basic Information
- **Project Name**: mrjob
- **Description**: Asana fork of mrjob with master at v0.4.2 with small changes
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-09-24
- **Last Updated**: 2025-12-27
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
mrjob
=====
.. image:: http://github.com/yelp/mrjob/raw/master/docs/logos/logo_medium.png
mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming
jobs.
`Stable version (v0.4.2) documentation `_
`Development version documentation `_
.. image:: https://travis-ci.org/Yelp/mrjob.png
:target: https://travis-ci.org/Yelp/mrjob
mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you
to buy time on a Hadoop cluster on an hourly basis. It also works with your own
Hadoop cluster.
Some important features:
* Run jobs on EMR, your own Hadoop cluster, or locally (for testing).
* Write multi-step jobs (one map-reduce step feeds into the next)
* Duplicate your production environment inside Hadoop
* Upload your source tree and put it in your job's ``$PYTHONPATH``
* Run make and other setup scripts
* Set environment variables (e.g. ``$TZ``)
* Easily install python packages from tarballs (EMR only)
* Setup handled transparently by ``mrjob.conf`` config file
* Automatically interpret error logs from EMR
* SSH tunnel to hadoop job tracker on EMR
* Minimal setup
* To run on EMR, set ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY``
* To run on your Hadoop cluster, install ``simplejson`` and make sure
``$HADOOP_HOME`` is set.
Installation
------------
From PyPI:
``pip install mrjob``
From source:
``python setup.py install``
A Simple Map Reduce Job
-----------------------
Code for this example and more live in ``mrjob/examples``.
.. code:: python
"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def combiner(self, word, counts):
yield (word, sum(counts))
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
Try It Out!
-----------
::
# locally
python mrjob/examples/mr_word_freq_count.py README.rst > counts
# on EMR
python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts
# on your Hadoop cluster
python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts
Setting up EMR on Amazon
------------------------
* create an `Amazon Web Services account `_
* sign up for `Elastic MapReduce `_
* Get your access and secret keys (click "Security Credentials" on
`your account page `_)
* Set the environment variables ``$AWS_ACCESS_KEY_ID`` and
``$AWS_SECRET_ACCESS_KEY`` accordingly
Advanced Configuration
----------------------
To run in other AWS regions, upload your source tree, run ``make``, and use
other advanced mrjob features, you'll need to set up ``mrjob.conf``. mrjob looks
for its conf file in:
* The contents of ``$MRJOB_CONF``
* ``~/.mrjob.conf``
* ``/etc/mrjob.conf``
See `the mrjob.conf documentation
`_ for more information.
Project Links
-------------
* `Source code `_
* `Documentation `_
* `Discussion group `_
Reference
---------
* `Hadoop Streaming `_
* `Elastic MapReduce `_
More Information
----------------
* `PyCon 2011 mrjob overview `_
* `Introduction to Recommendations and MapReduce with mrjob `_
(`source code `_)
* `Social Graph Analysis Using Elastic MapReduce and PyPy `_
Thanks to `Greg Killion `_
(`blind-works.net `_) for the logo.