# Multimodal-RecSys

**Repository Path**: ppandaer/Multimodal-RecSys

## Basic Information

- **Project Name**: Multimodal-RecSys
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: GPL-3.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-06-03
- **Last Updated**: 2025-06-03

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## Datasets

### MOOCCubeX
MOOCCubeX is a comprehensive dataset from XuetangX, containing:
- 4,216 courses
- 230,263 videos
- 358,265 exercises
- 637,572 concepts
- Behavioral data from 3,330,294 students

### Citation Network (DBLPv12)
DBLPv12 includes:
- 4,894,081 papers
- 45,564,149 citation relationships

## Data Processing and Evaluation

### Rating Estimation

#### MOOCCubeX
- **Behavioral-based ratings**: Derived from course completion rates.
- **Binary ratings**: Based on course enrollment, preferred for larger user coverage.

#### DBLPv12
- **Binary ratings**: Based on citation presence between papers.

### Negative Ratings
- Generated by random sampling of non-interacting user-item pairs.

### Text Processing

#### MOOCCubeX
- Translated course information from Chinese to English.
- Concatenated course fields into a single text document.

#### DBLPv12
- Concatenated paper titles, venues, abstracts, and fields of study.

### Graph Construction

#### MOOCCubeX
- Nodes: 694,528 students and 4,700 courses.
- Edges: 6,683,574 relations.

#### DBLPv12
- Nodes: 2,794,154 papers.
- Edges: 28,393,696 citations.

### Evaluation Metrics
- **HR@K**: Hit Rate at top K recommendations.
- **NDCG@K**: Normalized Discounted Cumulative Gain at top K recommendations.
- **MRR**: Mean Reciprocal Rank.

## Baseline Model

### SVD Matrix Factorization
- Implemented using TuriCreate with 32 latent factors and 50 iterations.

## Experiments

### NeuMF (Neural Matrix Factorization)
- Combines GMF and MLP to predict ratings.

### BERTMF
- Incorporates BERT embeddings for text data into the NeuMF model.

### GraphMF
- Uses Geometric Laplacian Eigenmap Embeddings (GLEE) for graph data.

### MultiMF
- Combines BERT embeddings and graph node embeddings for enhanced predictions.

## Running the Models

To run all the models first you need to run the pre-processing script to download the data
and process it. This could take couple hourse given that the translation of the documents, 
the graph contruction and embedding calculations are computatinally costly.

After that you can train the models using the train_{model}.py for MOOCCubeX 
and train_{model}_dblp.py for the citation network.

For the metrics using the metrics_{model}.py for MOOCCubeX 
and metrics_{model}_dblp.py for the citation network.