# hackthon

**Repository Path**: osgood001/hackthon

## Basic Information

- **Project Name**: hackthon
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2025-08-22
- **Last Updated**: 2025-09-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Train an MCP Thinker

Agentic Learning Using Thinking Tools is the path to better physicist/domain expert

## 💡 Motivation

Recent works on evaluating LLM capabilities like [PySense](https://arxiv.org/abs/2505.24823) have shown that, SOTA AI Models like Gemini 2.5 Pro or Claude 3.7 Thinking fail to solve physics problems that requires Physics intuition like symmetry, conservation law and dimensional analysis, leading to large errors and high cost of tokens, while human expert could solve it at the back of the envelope.

And it's shown that using "Hint" about these ideas alone is not enough to boost AI performance. More training data is expected to solve this problem.

However, in a [blog post](https://www.anthropic.com/engineering/claude-think-tool) by Antropic(Developer of Claude series models), they propose a "Think Tool", which is an MCP tool based a pre-prompted LLM. Using proper prompt and this think tool, Same Agent could reach nearly 100% performance boost on some under-represented domains like airline check.

This introduced the recipe:

    Reflect(Think) + Prompt = Stronger AI in Smaller Field

Following similar spirit, an [arxiv paper](https://arxiv.org/abs/2507.15855v1) propose that _Gemini 2.5 Pro Capable of Winning Gold at IMO 2025_. The methology it use is a workflow of self-critic and reflect, using carefully designed prompts. And a little bit of human expert hint:

```mma
(For Problem 1) Use induction

(For Problem 3) Use Analytic Geometry
```

While these hints are short, they seems necessary for the success of this project. So the IMO board pointed out that such simple hint could have dramatic influence on the evaluation of different LLM Models.

Nevertheless, this paper showed that, using the recipe above, current models could achieve _IMO Gold Medal_ performance.


## 🎯 Project Overview

> Domain knowledge need AI to evolve, we evolve it and worked, so it's promising.

Having a LLM that can learn and evolve its own domain knowledge is a promising direction, this project build a self-evolving Agent based on Model Context Protocal. We evolve it using several math and physics problems, and using this "Trained" Agent, we achieve better performance than original Agent.


An example Think Tool designed by Agent itself, learning from mistakes on mishandling translation or rotation:

```python
@mcp.tool()
async def reorient(input: str) -> str:
    """Use this function whenever you need to move an object to the origin, apply a rotation or other transformation, and then move it back while checking each step. It separates the three phases, requires you to state the arithmetic operation in words, and verifies the intermediate results before proceeding. This disciplined approach prevents errors caused by mixing translation and rotation."""
    prompt = f"""When solving a problem that requires moving something and then changing its orientation, always follow three distinct phases. First move the object so that the reference point becomes the origin. Second apply the rotation or other transformation while the object is at the origin. Third move the object back to its original location. Keep each phase separate and state clearly what you are adding, subtracting, or multiplying in words. After each phase pause and check the intermediate result against what you expect for that step. Do not skip the initial move or combine the steps without verification. Always respect the order of operations and confirm the final answer by comparing it with an independent check or by reversing the steps mentally. If any part does not match, return to the most recent phase and recompute before proceeding. This disciplined sequence and verification routine will prevent the kind of error that arises from mishandling translation or rotation..Please follow the instruction. The input is: """ + input
    return await ask_llm_with_retries([{"role": "user", "content": prompt}])
```

Check out `logs/` to see how models arrive at this.

## 🏗️ Architecture


> Agent solve problems, make mistakes, sefl-correct and compare with answers

> When all is done, time to reflect on this and come up with lessons that can be learnt from this
experience.

### Core Components

#### Train 
- `solver.py`: The main function that supports training and evaluating the Agent
- `original_process.py`: The simple problem solving process that can be trained.
- `self_reflect_process_entry.py`: A self-reflect framework that can be trained.
- `agent_process.py`: Problem solving process using Agent that can access tools defined in `MCP/experience.py`
- `MCP/create.py`: A tool that can create new MCP tools on the fly

#### Evaluate
- `MCP/experience.py`: The MCP server that will be modified by Agent to add new tools
- `extract_boxed.py`: A tool that can extract the answer from the output of LLM
- `equation_equivilancy.py`: A tool that can check the equivilancy of two equations, used in verification of answers.

## 🚀 Key Features

> After training, the Agent obtained a long list of tools. Using these tools, performance improved.


### Analysis Tools
- `exactify`: Precise data formatting and validation
- `validate`: Rigorous solution checking
- `sanitycheck`: Unit and magnitude verification

### Quantum Mechanics
- `angularize`: Angular momentum analysis
- `harmonicoscillator`: Quantum oscillator solutions
- `two_state_evolution`: Two-level system dynamics

### Mathematical Methods
- `asymptotic`: Asymptotic analysis and limits
- `perturbSolve`: Perturbation theory applications
- `deriveObservable`: Observable derivation from principles

### Statistical Physics
- `statweight`: Statistical weight calculations
- `calcavg`: Average value computations
- `scalePredict`: Scaling law predictions


- `perturbationAudit`: Rigorous perturbation theory verification
- `quantumguard`: Quantum constant validation and dimensional analysis
- `harmonicoscillator`: Complete quantum harmonic oscillator solutions
- `statweight`: Statistical weight calculations
- `symmetryChecklist`: Symmetry analysis for quantum states
- `deriveObservable`: Observable quantity derivation from first principles


[Detailed benchmarking Results will be provided later]


## 🛠️ Setup Instructions

### Prerequisites
- Python 3.8 or higher
- UV package manager (`pip install uv`)
- OpenAI Compatible API key and BASE URL (set in `.env`)

### Environment Setup

1. Clone the repository
2. Install dependencies:
   ```bash
   uv pip install -r requirements.txt
   ```
3. Set up environment variables (see `.env` for details):
   ```bash
   export URL="your-openai-api-url"
   export API_KEY="your-openai-api-key"
   export MODEL="your-model-name"
   ```

### Edit the `solver.py` for different modes

Inside the file, at the head set:

```py
process_type = "origin" # or "agent" or "self" for different modes
train = True # whether to train the MCP (will append new tools based on experience)
subject = "math" # or "physics", this leads to different MCP tools server. This is to avoid too many tools in a single server. And also avoid domain interfere.
```

at the end of the file, set:

```py
    llm = os.getenv("MODEL")
    # if train or not are in diffrent dirs
    if train:
        base_output_dir = f"{process_type}_output/{llm}_output_train"
    else:
        base_output_dir = f"{process_type}_output/{llm}_output"
    input_jsonl_list = [ # the dataset used to train the model
        # "datasets/atomic_dataset.jsonl", 
        # "datasets/electro_dataset.jsonl", 
        # "datasets/mechanics_dataset.jsonl", 
        # "datasets/optics_dataset.jsonl",
        # "datasets/quantum_dataset.jsonl", 
        # "datasets/statistics_dataset.jsonl"
        "MATH500/test.jsonl"
        ]
    max_lines = 50 # the maximum problems the model should process
    main(llm, base_output_dir, input_jsonl_list, max_lines)
```

### Download the dataset used to train/evaluate the model

Here I use PHYSICS and MATH500 to train, download them and place them in corresponding dirs. If new dataset is used, one should edit the `process_entry` function to handle different format.


### Running the System
```bash
python solver.py
```

## 🎓 Applications

This Agentic Learning framework enable domain knowledge discovery based on **reflection** of **experience**, and can be intepreted and shared via **MCP** tools, which is promising for educational and research applications.

This system is ideal for:

- **Education**: Demonstrating problem-solving methodologies and how to reflect on the learning process
- **RL**: A precursor for Reinforcement Learning/Fine-tuning to make LLMs better
- **Benchmark**: Can be applied to generate, filter, evaluate and improve datasets used to train LLMs
- **Sharing**: Knowledge can be edit, shared and recombined between experts and users, forming a community that values experience and ideas.

      Reflect(Think) + Prompt = Stronger AI in Smaller Field

## 🤝 Contributing

This is a demo project, so it may go under rapid development.


## 📄 License

This project was developed as part of a hackathon hosted in Tsinghua University. Please check with the author for licensing details.

## 🙏 Acknowledgments

- The sponsors and organizers of this hackathon