# portia

**Repository Path**: mirrors_leecade/portia

## Basic Information

- **Project Name**: portia
- **Description**: Visual scraping for Scrapy
- **Primary Language**: Unknown
- **License**: BSD-3-Clause
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-08-09
- **Last Updated**: 2026-02-07

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

portia
======
Visual scraping for Scrapy.


Overview
========

Portia is a tool for visually scraping web sites without any programming knowledge. Just annotate web pages with a point and click editor to indicate what data you want to extract, and portia will learn how to scrape similar pages
from the site.

Portia has a web based UI served by a [Twisted] server, so you can install it on almost any modern platform.

Requirements
============

* Python 2.7
* Works on Linux, Windows, Mac OSX, BSD
* Supported browsers: Latest versions of Chrome (recommended) or Firefox


Repository structure
====================

There are two main components in this repository, __slyd__ and __slybot__:

###slyd

The visual editor used to create your scraping projects.

###slybot

The Python web crawler that performs the actual site scraping. It's implemented on top of the [Scrapy] web crawling
framework and the [Scrapely] extraction library. It uses projects created with __slyd__ as input.


How to install portia
=============================

The recommended way to install dependencies is to use __virtualenv__ and then do:

	cd slyd
	pip install -r requirements.txt

As __slybot__ is a __slyd__ dependency, it will also get installed.

Running portia
==============

First, you need to start the ui and create a project. Run __slyd__ using:

	cd slyd
	twistd -n slyd

and point your browser to: `http://localhost:9001/static/main.html`

Choose the site you want to scrape and create a project. Every project is created with a default spider named after the domain of the site you are scraping. When you are ready, you can run your project with __slybot__ to do the actual crawling/extraction.

Projects created with __slyd__ can be found at:

	slyd/data/projects

To run one of those projects use:

	portiacrawl project_path spidername

Where `spidername` should be one of the project spiders. If you don't remember the name of the spider, just use:

	portiacrawl project_path

and you will get the list of spiders for that project.


[Twisted]: https://twistedmatrix.com
[Scrapely]: https://github.com/scrapy/scrapely
[Scrapy]: http://scrapy.org