# cerebro

**Repository Path**: mirrors_twilio/cerebro

## Basic Information

- **Project Name**: cerebro
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: GPL-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-10-22
- **Last Updated**: 2025-11-30

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

Cerebro
=======
Cerebro finds secrets such as passwords, tokens, private keys and more in a Git repo.

## Tech Stack
Cerebro requires:
* Python 3.5
* SQLite

## Getting Started
### Configure Target Repositories
Populate the `targets.yaml` file in the config directory using the example:

`$ cp config/targets.example.yaml config/targets.yaml`

### Local Usage
Clone this repo and export the following environment variables:
* CEREBRO_DATABASE_URL - full/path/to/sqlite/database/file

If you wish to receive Cerebro results in Slack, also configure:
* SLACK_API_URL - Incoming web hooks endpoint from Slack
* SLACK_CHANNEL_OR_USER - The @user or #channel to send scan notifications to

Set up the environment:

`$ make local-install`

Execute (or setup a cron job for the following code snippet):

`$ python cerebro.py` 

or

`$ make local-run`

Run the tests:

`$ make local-test`

### Docker-compose Usage
Copy the env-example file & edit it appropriately:

`$ cp env-example to .env`

Build the docker environment (it will use Ubuntu-latest)

`$ make docker-build`

Run the tests:

`$ make docker-tests`

Run cerebro:

`$ make docker-run`

## Tests
Execute
```
$ pytest -sv tests/
```

## Reporting
A summary of results is provided in JSON format by default or can be provided via Slack, while detailed results can be reviewed directly in SQLite or
[Todo - Add the url of the *cerebro* dashboard once we have a box configured for it]. Alternatively, results can be viewed directly in SQLite.

## Definitions
These definitions describe how raw data is processed and stored:
* BLOCK_SIZE - this is the size for any contiguous set of characters (i.e. BASE64 or HEXADECIMAL) searched for in the
codebase entropy. Default is 20
* TOKENS - a BLOCK_SIZE of characters that were matched during the scan process
* BLOBS - represents portions of a file containing a TOKEN

## Design Notes
There are 3 high-level components involved in the operation of *cerebro*, they are:
* Git Level Operations
    * Pulling the latest commit of the ```master``` branch from each repo in ```targets.yaml```, checking for
    diffs in repo if repo had been previously scanned (i.e. pulled) and creating sub-directories with "diffed" content
    (i.e. stored in ```workspace/diffs```) for subsequent scanning.
* Operating System Level Operations:
    * ```targets.yaml```: a list of repos for *cerebro* to scan.
    * ```bad_patterns.txt```: a list of regexes used by ```egrep```.
    * ```egrep``` : performs recursive regex grepping for each repo from ```targets.yaml``` using patterns from
    ```bad_patterns.txt```.
* Python Level Operations:
    * Each matched string is tested for entropy
    using [Shannon's algorithm](http://blog.dkbza.org/2007/05/scanning-data-for-entropy-anomalies.html),
    the basic concept of which is - a BLOCK_SIZE of BASE64 characters with an entropy greater than 4.5 or BLOCK_SIZE
    of HEXADECIMAL characters with entropy greater than 3.0 is flagged as a TOKEN.
    * For config files however (i.e. .conf, .yaml, .ini, .erb, .rb), we set the BLOCK_SIZE to 6, which ensures that
    smaller chunks of tokens with sufficient entropy are matched
    * These results are then further filtered by options set in the ```main.yaml``` configuration file e.g. excluding test or 3rd-party library framework directories and or specific files from the search.