# ScrapeServ

**Repository Path**: mirrors_daimajia/ScrapeServ

## Basic Information

- **Project Name**: ScrapeServ
- **Description**: A self-hosted API that takes a URL and returns a file with browser screenshots.
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-02-16
- **Last Updated**: 2025-11-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ScrapeServ: Simple URL 🌐 to screenshots 📷 server

You run the API on your machine, you send it a URL, and you get back the website data as a file plus screenshots of the site. Simple as.

This project was made to support [Abbey](https://github.com/US-Artificial-Intelligence/abbey), an AI platform. Its author is [Gordon Kamer](https://x.com/gkamer8).

Some highlights:
- Scrolls through the page and takes screenshots of different sections
- Runs in a docker container
- Browser-based (will run websites' Javascript)
- Gives you the HTTP status code and headers from the first request
- Automatically handles 302 redirects
- Handles download links properly
- Tasks are processed in a queue with configurable memory allocation
- Blocking API
- Zero state or other complexity

This web scraper is resource intensive but higher quality than many alternatives. Websites are scraped using Playwright, which launches a Firefox browser context for each job.

## Setup

You should have Docker and `docker compose` installed.

1. Clone this repo
2. Run `docker compose up` (a `docker-compose.yml` file is provided for your use)

...and the service will be available at `http://localhost:5006`. See the Usage section below for details on how to interact with it.

### API Keys

You may set an API key using a `.env` file inside the `/scraper` folder (same level as `app.py`).

You can set as many API keys as you'd like; allowed API keys are those that start with `SCRAPER_API_KEY`. For example, here is a `.env` file that has three available keys:

```
SCRAPER_API_KEY=should-be-secret
SCRAPER_API_KEY_OTHER=can-also-be-used
SCRAPER_API_KEY_3=works-too
```

API keys are sent to the service using the Authorization Bearer scheme.

## Usage

The root path `/` returns status 200 if online, plus some Gilbert and Sullivan lyrics (you can go there in your browser to see if it's online).

The only other path is `/scrape`, to which you send a JSON formatted POST request and (if all things go well) receive a `multipart/mixed` type response.

The response will be either:

- Status 200: `multipart/mixed` response where the first part is type `application/json` with information about the request; the second part is the website data (usually `text/html`); and the remaining parts are up to 5 screenshots.
- Not status 200: `application/json` response with an error message under the "error" key.

Here's a sample cURL request:

```
curl -X POST "http://localhost:5006/scrape"
    -H "Content-Type: application/json"
    -d '{"url": "https://us.ai"}'
```

Here is a code example using Python and the requests_toolbelt library to let you interact with the API properly:

```
import requests
from requests_toolbelt.multipart.decoder import MultipartDecoder
import sys
import json

data = {
    'url': "https://us.ai"
}
# Optional if you're using an API key
headers = {
    'Authorization': f'Bearer Your-API-Key'
}

response = requests.post('http://localhost:5006/scrape', json=data, headers=headers, timeout=30)
if response.status_code != 200:
    my_json = response.json()
    message = my_json['error']
    print(f"Error scraping: {message}", file=sys.stderr)
else:
    decoder = MultipartDecoder.from_response(response)
    resp = None
    for i, part in enumerate(decoder.parts):
        if i == 0:  # First is some JSON
            json_part = json.loads(part.content)
            req_status = json_part['status']  # An integer
            req_headers = json_part['headers']  # Headers from the request made to your URL
            metadata = json_part['metadata']  # Information like the number of screenshots and their compressed / uncompressed sizes
            # ...
        elif i == 1:  # Next is the actual content of the page
            content = part.content
            headers = part.headers  # Will contain info about the content (text/html, application/pdf, etc.)
            # ...
        else:  # Other parts are screenshots, if they exist
            img = part.content
            headers = part.headers  # Will tell you the image format
            # ...
```

## Security Considerations

Navigating to untrusted websites is a serious security issue. Risks are somewhat mitigated in the following ways:

- Runs as isolated container (container isolation)
- Each website is scraped in a new browser context (process isolation)
- Strict memory limits and timeouts for each task
- Checks the URL to make sure that it's not too weird (loopback, non http, etc.)

You may take additional precautions depending on your needs, like:

- Only giving the API trusted URLs (or otherwise screening URLs)
- Running this API on isolated VMs (hardware isolation)
- Using one API instance per user
- Not making any secret files or keys available inside the container (besides the API key for the scraper itself)

**If you'd like to make sure that this API is up to your security standards, please examine the code and open issues! It's not a big repo.**

## Other Configuration

You can control memory limits and other variables at the top of `scraper/worker.py`. Here are the defaults:

```
MEM_LIMIT_MB = 4_000  # 4 GB memory threshold for child scraping process
MAX_SCREENSHOTS = 5
SCREENSHOT_JPEG_QUALITY = 85
BROWSER_HEIGHT = 2000
BROWSER_WIDTH = 1280
USER_AGENT = "Mozilla/5.0 (compatible; Abbey/1.0; +https://github.com/US-Artificial-Intelligence/scraper)"
```