* A flexible tool for building web scrapers *
About • Installation • Execution • Plugins • License
WebCap delivers a flexible tool for building and executing web scrapers, offering two key modes:
-
Run Mode: Execute scraping tasks against target websites (currently
imovirtual,idealista,supercasa, andcasasapo). You can define search parameters (location, pages, max results) and choose a browser GUI. -
Dev Mode: Rapidly develop new scrapers. One can specify elements to extract using natural language, and a selected LLM (currently
gemini,gpt,deepseek, andcluade) automatically generates the precise XPath queries for a fully functional result.
-
LLM-Powered Development: Translates natural language into functional XPath logic.
-
Flexible Configuration: Supports command-line options, environment variables, and YAML configuration files for simplified usage.
-
Proxy Integration: Built-in support for ProxyScrape authentication and regional settings.
-
VSCode Browser: Option to run the browser instance directly within a VSCode tab for enhanced development.
Follow the official installation guide:
https://virtualenvwrapper.readthedocs.io/en/latest/install.html
To install all dependencies and configure the environment, run:
source install.shFor clarity, autopep8 formatting rules are strongly encouraged.
To format all python code use:
cd src; ./format.shYou can perform a web scraping run on supercasa using:
webscap run \
supercasa \
--target lisboa \
--pages 2 \
--results 60 \
--db houses.db
--gui browserrun: Run mode. (Options:run,dev).supercasa: The target website. (Use--helpfor all options).--target: The target location or category to search for. (e.g.,lisboa).--pages: The number of pages to scrape.--results: Maximum number of results to store.--db: The database name in which to store the results.--gui: Visually open the navigation browser (Current optionsbrowserandvscode).
The previous command will output a directory .webcap containing two files:
houses.db- The actual sqlite db containing the scraping results.data.json- The scraping results in JSON format.
The generated .webcap directory can also be configures using the --out (-o) flag.
You can interactively develop a demo scraper for supercasa using:
webcap dev supercasa --target lisboa --llm deepseek dev: Development mode.--llm: LLM used for XPath query generation. (Current options:claude,deepseek,gemini,gpt).
In development mode you will be prompted to specify, in natural language, each element to extract from the webpage.
Each requested element is initially registered through a placeholder call:
webcap.fetch('#1')After the development session completes, these placeholder calls are automatically replaced with generated XPath queries:
webcap.fetch('#1') → webcap.fetch('#1', <generated-query>)This produces a fully functional scraper with all extraction queries filled in.
One can generate a scraper for given website using:
webcap gen <custom-scraper> <url>This command prompts the user for a description of what to scrape and how to extract the data. Based on this input, it generates a Python file named: <custom-scraper>.py. This file contains a scraper configured for the website specified by <url>.
To generate or refine the XPath queries for each fetch() call in the synthesised scraper, one can run the scraper in development mode using:
python3 custom-scraper.py devThis mode behaves similarly to the dev mode described above.
Once the scraper is configured, i.e., all the fetch calls contain a valid Xpath expression, one can run it using:
python3 custom-scraper.py runThis run mode also supports the options described in the run mode section above.
If you wish to use a ProxyScrape proxy, you’ll need authentication credentials (username and password).
These can be provided either through command-line options or the environment variable PROXY_AUTH.
webcap run supercasa -proxy --auth=<username>:<password>PROXY_AUTH=<username>:<password> webcap run supercasa -proxyYou can set the PROXY_AUTH variable in your shell configuration file so that it’s automatically loaded in future sessions. Example:
echo "export PROXY_AUTH=<username>:<password>" >> ~/.zshrc
source ~/.zshrcYou can optionally configure the proxy used during scraping. These settings allow you to control aspects such as the region and authentication.
By default, the proxy region is chosen randomly. To select a specific region, pass the region option through the --proxy-config flag:
webcap run supercasa --proxy-config region=ptIn this example:
-
region=ptselects Portugal as the proxy region. -
The proxy authentication is provided via the
PROXY_AUTHenvironment variable.
By default, the database is saved as .webcap/webcap.db. You can customize this behavior with the --db option in two ways:
-
Specify the database folder:
--out <folder>; -
Specify a custom database name:
--db file=<name>; -
Create a new database, overwriting an existing one:
--out <existing_folder> --db file=<existing_name> overwrite=trueYou can convert a SQLite database to JSON using the helper script db2json.py.
To dump all tables from a database, run:
./db2json.py databases/webcap.dbFor a full list of available options, use: ./db2json.py --help
To simplify usage, you can provide a YAML configuration file containing all the settings that would be passed as command-line options.
For example, a config.yaml file might look like this:
target: lisboa
out: supercasa
pages: 20
proxy:
auth:
username: <user>
password: <pass>
region: pt
db:
file: houses.db
overwrite: true
logging:
level: info
file: out.logYou can then run the scraper using:
webcap run supercasa --config config.yamlThis is equivalent to passing the same options directly via the command line:
webcap run supercasa \
--target lisboa \
--out supercasa \
--auth <user>:<pass> \
--proxy region=pt \
--db file=houses.db overwrite=true \
--logging level=info file=out.logInstead of using Playwright’s default browser window, you can run the browser inside a VSCode tab. To do this:
- Open the repository in VSCode.
- Press F5 in the repository’s root directory, or go to Run and Debug and select
Run VSCode Browser.
This will open a new VSCode window. In this window, you can start a browser instance using the command palette:
- Open the command palette (Ctrl+Shift+P or Cmd+Shift+P on macOS).
- Run the command
Browser: Start Server. - Run
webcapnormally as shown before.
This project is licensed under the [GPL-3.0 License] -- see LICENSE for details.


