# portia **Repository Path**: mirrors_sudoconf/portia ## Basic Information - **Project Name**: portia - **Description**: Visual scraping for Scrapy - **Primary Language**: Unknown - **License**: BSD-3-Clause - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-09-26 - **Last Updated**: 2026-04-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README portia ====== Visual scraping for Scrapy. Overview ======== Portia is a tool for visually scraping web sites without any programming knowledge. Just annotate web pages with a point and click editor to indicate what data you want to extract, and portia will learn how to scrape similar pages from the site. Portia has a web based UI served by a [Twisted] server, so you can install it on almost any modern platform. Requirements ============ * Python 2.7 * Works on Linux, Windows, Mac OSX, BSD * Supported browsers: Latest versions of Chrome (recommended) or Firefox Prerequisites ============= You might need to run the following commands to install the required tools & libraries before building portia: apt-get install python-pip python-dev libxml2-dev libxslt1-dev libffi-dev libssl-dev pip install virtualenv Installation ============ The recommended way to install dependencies is to use __virtualenv__: virtualenv YOUR_ENV_NAME --no-site-packages and then do: source YOUR_ENV_NAME/bin/activate cd slyd pip install -r requirements.txt As `slybot` is a `slyd` dependency, it will also get installed. **Note:** you may need to use `sudo` or `pip --user` if you get permissions problems while installing. Running portia ============== First, you need to start the ui and create a project. Run __slyd__ using: cd slyd twistd -n slyd and point your browser to: `http://localhost:9001/static/main.html` Choose the site you want to scrape and create a project. Every project is created with a default spider named after the domain of the site you are scraping. When you are ready, you can run your project with __slybot__ to do the actual crawling/extraction. Projects created with __slyd__ can be found at: slyd/data/projects To run one of those projects use: portiacrawl project_path spidername Where `spidername` should be one of the project spiders. If you don't remember the name of the spider, just use: portiacrawl project_path and you will get the list of spiders for that project. Portia spiders are ultimately [Scrapy] spiders. You can pass __scrapy__ spider arguments when running them with ```portiacrawl``` by using the ```-a``` command line option. A custom settings module may also be specified using the ```--settings``` command line option. Please refer to the [scrapy documentation] for details on arguments and settings. Running portia with [vagrant] ============================= This is probably the easiest way to install and run portia. First, you need to get: * Vagrant: http://www.vagrantup.com/downloads.html * VirtualBox: https://www.virtualbox.org/wiki/Downloads After that ```cd``` into the repo directory and run: vagrant up This will setup and start an ubuntu virtual machine, build portia and launch the ```slyd``` server for you. Just point your browser to `http://localhost:8000/static/main.html` after vagrant has finished the whole process (you should see ```default: slyd start/running, process XXXX``` in your console) and you can start using portia. You can stop the server with ```vagrant suspend``` or ```vagrant halt```. The repository directory is shared with the VM, so you don't need to do anything special to keep it in sync. You can __ssh__ into the virtual machine by running ```vagrant ssh```. The repo dir will be mounted at ```/vagrant``` in the VM. Please note that you __need to ssh into the VM to run the ```portiacrawl``` script__. Repository structure ==================== There are two main components in this repository, __slyd__ and __slybot__: ###slyd The visual editor used to create your scraping projects. ###slybot The Python web crawler that performs the actual site scraping. It's implemented on top of the [Scrapy] web crawling framework and the [Scrapely] extraction library. It uses projects created with __slyd__ as input. [Twisted]: https://twistedmatrix.com [Scrapely]: https://github.com/scrapy/scrapely [Scrapy]: http://scrapy.org [scrapy documentation]: http://doc.scrapy.org/en/latest [vagrant]: http://www.vagrantup.com