Skip to content

formalsec/webcap

Repository files navigation

WebCap

* A flexible tool for building web scrapers *

AboutInstallationExecutionPluginsLicense



About

WebCap delivers a flexible tool for building and executing web scrapers, offering two key modes:

  • Run Mode: Execute scraping tasks against target websites (currently imovirtual, idealista, supercasa, and casasapo). You can define search parameters (location, pages, max results) and choose a browser GUI.

  • Dev Mode: Rapidly develop new scrapers. One can specify elements to extract using natural language, and a selected LLM (currently gemini, gpt, deepseek, and cluade) automatically generates the precise XPath queries for a fully functional result.

Key Features

  • LLM-Powered Development: Translates natural language into functional XPath logic.

  • Flexible Configuration: Supports command-line options, environment variables, and YAML configuration files for simplified usage.

  • Proxy Integration: Built-in support for ProxyScrape authentication and regional settings.

  • VSCode Browser: Option to run the browser instance directly within a VSCode tab for enhanced development.



Prerequisites

Install virtualenvwrapper (recommended)

Follow the official installation guide:
https://virtualenvwrapper.readthedocs.io/en/latest/install.html



Installation

To install all dependencies and configure the environment, run:

source install.sh

Formatting

For clarity, autopep8 formatting rules are strongly encouraged. To format all python code use:

cd src; ./format.sh

Execution Modes

Run Mode

You can perform a web scraping run on supercasa using:

webscap run \
    supercasa \
    --target lisboa \
    --pages 2 \
    --results 60 \
    --db houses.db
    --gui browser

Parameters:

  • run: Run mode. (Options: run, dev).
  • supercasa: The target website. (Use --help for all options).
  • --target: The target location or category to search for. (e.g., lisboa).
  • --pages: The number of pages to scrape.
  • --results: Maximum number of results to store.
  • --db: The database name in which to store the results.
  • --gui: Visually open the navigation browser (Current options browser and vscode).

The previous command will output a directory .webcap containing two files:

  1. houses.db - The actual sqlite db containing the scraping results.
  2. data.json - The scraping results in JSON format.

The generated .webcap directory can also be configures using the --out (-o) flag.


Dev Mode

You can interactively develop a demo scraper for supercasa using:

webcap dev supercasa --target lisboa --llm deepseek 

Parameters:

  • dev: Development mode.
  • --llm: LLM used for XPath query generation. (Current options: claude, deepseek, gemini, gpt).

In development mode you will be prompted to specify, in natural language, each element to extract from the webpage.

Each requested element is initially registered through a placeholder call:

webcap.fetch('#1')

After the development session completes, these placeholder calls are automatically replaced with generated XPath queries:

webcap.fetch('#1') → webcap.fetch('#1', <generated-query>)

This produces a fully functional scraper with all extraction queries filled in.

Gen Mode

One can generate a scraper for given website using:

webcap gen <custom-scraper> <url>

This command prompts the user for a description of what to scrape and how to extract the data. Based on this input, it generates a Python file named: <custom-scraper>.py. This file contains a scraper configured for the website specified by <url>.

Developing

To generate or refine the XPath queries for each fetch() call in the synthesised scraper, one can run the scraper in development mode using:

python3 custom-scraper.py dev

This mode behaves similarly to the dev mode described above.

Running

Once the scraper is configured, i.e., all the fetch calls contain a valid Xpath expression, one can run it using:

python3 custom-scraper.py run

This run mode also supports the options described in the run mode section above.



Configurations

Proxy Configuration

If you wish to use a ProxyScrape proxy, you’ll need authentication credentials (username and password). These can be provided either through command-line options or the environment variable PROXY_AUTH.

Using the CLI:

webcap run supercasa -proxy --auth=<username>:<password>

Using an Env Variable:

PROXY_AUTH=<username>:<password> webcap run supercasa -proxy

Setting the Environment Variable Permanently

You can set the PROXY_AUTH variable in your shell configuration file so that it’s automatically loaded in future sessions. Example:

echo "export PROXY_AUTH=<username>:<password>" >> ~/.zshrc
source ~/.zshrc

Proxy Settings

You can optionally configure the proxy used during scraping. These settings allow you to control aspects such as the region and authentication.

Region Selection

By default, the proxy region is chosen randomly. To select a specific region, pass the region option through the --proxy-config flag:

webcap run supercasa --proxy-config region=pt

In this example:

  • region=pt selects Portugal as the proxy region.

  • The proxy authentication is provided via the PROXY_AUTH environment variable.

Databases

By default, the database is saved as .webcap/webcap.db. You can customize this behavior with the --db option in two ways:

  1. Specify the database folder: --out <folder>;

  2. Specify a custom database name: --db file=<name>;

  3. Create a new database, overwriting an existing one:

--out <existing_folder> --db file=<existing_name> overwrite=true

Convert Database to JSON

You can convert a SQLite database to JSON using the helper script db2json.py.
To dump all tables from a database, run:

./db2json.py databases/webcap.db

For a full list of available options, use: ./db2json.py --help

Using a config file

To simplify usage, you can provide a YAML configuration file containing all the settings that would be passed as command-line options.

For example, a config.yaml file might look like this:

target: lisboa
out: supercasa
pages: 20
proxy:
  auth:
    username: <user>
    password: <pass>
  region: pt
db:
  file: houses.db
  overwrite: true
logging:
  level: info
  file: out.log

You can then run the scraper using:

webcap run supercasa --config config.yaml

This is equivalent to passing the same options directly via the command line:

webcap run supercasa \
        --target lisboa \
        --out supercasa \ 
        --auth <user>:<pass> \ 
        --proxy region=pt \
        --db file=houses.db overwrite=true \
        --logging level=info file=out.log


Plugins

VSCode Browser

Instead of using Playwright’s default browser window, you can run the browser inside a VSCode tab. To do this:

  1. Open the repository in VSCode.
  2. Press F5 in the repository’s root directory, or go to Run and Debug and select Run VSCode Browser.

This will open a new VSCode window. In this window, you can start a browser instance using the command palette:

  1. Open the command palette (Ctrl+Shift+P or Cmd+Shift+P on macOS).
  2. Run the command Browser: Start Server.
  3. Run webcap normally as shown before.

Visual Guide

Step 2

Step 4



License

This project is licensed under the [GPL-3.0 License] -- see LICENSE for details.

About

WebCAP

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors