# ades

**Repository Path**: mirrors_cloudera/ades

## Basic Information

- **Project Name**: ades
- **Description**: An analysis of adverse drug event data using Hadoop, R, and Gephi
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-08-08
- **Last Updated**: 2026-05-23

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Adverse Drug Event Analysis with Hadoop, R, and Gephi

## Introduction

This project contains code for running an analysis of adverse drug events using the
 Multi-Item Gamma Poisson Shrinker (MGPS) model described in
[Empirical bayes screening for multi-item associations](http://dl.acm.org/citation.cfm?id=502526).

## Prerequistes

### Code

This analysis is designed to be small enough that you can run it on a single machine if you
do not have access to a Hadoop cluster. You will need to have a version of [CDH3](https://ccp.cloudera.com/display/CDHDOC/CDH4+Installation)
on your local machine, along with the version of Pig that is compatible with that version.

You will need to have Maven for compiling the Pig user-defined functions, and may also want to have a
copy of [R](http://www.r-project.org/) and [Gephi](http://gephi.org/) for certain phases of the analysis.

### Data
The input data for this analysis may be downloaded from the FDA's [AERS website](http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm083765.htm). You'll need to get the ASCII version of the data files for as many quarters as you would like to run over. For my own analysis, I used the data from 2008 through 2010.

The Pig scripts below assume that the input data is stored in three HDFS directories under
the user's home directory: aers/drugs, aers/demos, and aers/reactions. All of the DRUG\*.TXT
files from the AERS website should go into aers/drugs, all of the DEMO\*.TXT files should go
into aers/demos, and all of the REAC\*.TXT files should go into aers/reactions.

## Running the Pipeline

If you have not done so already, load the input data into the Hadoop cluster:

	hdfs dfs -mkdir aers
	hdfs dfs -mkdir aers/drugs
	hdfs dfs -put DRUG*.TXT aers/drugs
	hdfs dfs -mkdir aers/demos
	hdfs dfs -put DEMO*.TXT aers/demos
	hdfs dfs -mkdir aers/reactions
	hdfs dfs -put REAC*.TXT aers/reactions

Each of these commands should be run from the project's top-level directory,
i.e., the directory that contains this README file.

	mvn package  # Builds the Pig UDFs
	pig -f src/main/pig/step1_join_drugs_reactions.pig
	pig -f src/main/pig/step2_generate_drug_reaction_counts.pig
	pig -f src/main/pig/step3_generate_squashed_distribution.pig

At this point, you can optionally run the R code to solve the MGPS
optimization problem. You will need to install the _BB_ library in your
local version of R using _install.packages("BB")_ if you do not have it already.

	hadoop fs -getmerge aers/drugs2_reacs_stats d2r_stats.csv
	Rscript src/main/R/ebgm.R d2r_stats.csv

The output from the optimization run may be plugged into the Pig script
that scores the tuples, or you can just use the default parameters that
are there now:

	pig -f src/main/pig/step4_apply_ebgm.pig

The final output will be in *aers/scored_drugs2_reacs*. To generate the GEXF
file of drug-drug interactions to load into Gephi, run:

	hadoop fs -getmerge aers/scored_drugs2_reacs scored_d2r.csv
	./src/main/python/gephi.py scored_d2r.csv > drugs.gexf