Name	Name	Last commit message	Last commit date
parent directory ..
content_dedup	content_dedup
data/support_set_embeddings	data/support_set_embeddings
docs	docs
notebooks	notebooks
scripts	scripts
src	src
README.md	README.md
requirements_tol2webdataset.txt	requirements_tol2webdataset.txt
tol2webdataset_full_224.yaml	tol2webdataset_full_224.yaml

Name

Last commit message

Last commit date

data/support_set_embeddings

requirements_tol2webdataset.txt

tol2webdataset_full_224.yaml

Processing Tools

As noted in the root README, this repository's overall structure is still in progress, as the contents of this directory are being worked into the larger tool package.

Content-Based Filtering

The content_dedup/ directory contains the content-based deduplication code that was used to generate perceptual hashes for test sets to compare to our training data that could overlap. It also contains early exploration of the method (in content_dedup/experiments/).

Data Directory

The data/ directory contains the support set embeddings used for museum specimen image filtering (as detailed in Appendix I.2.1 of our paper).

Docs Directory

This folder contains a series of files named requirements_<processing-step>.txt, detailing the requisite packages for said processing step (e.g., requirements_batch_camera_trap.txt for camera trap image processing).

Notebooks

The notebooks/ directory houses the notebooks used to perform their respective processing steps (most, if not all, of which were adapted to modules run from slurm scripts in the scripts/ directory). Be sure to set the appropriate BASE_PATH value at the top of each notebook.

Scripts

Be sure to set appropriate BASE_DIR variables in scripts/mongo/ and scripts/processing/ directories, the latter of which also requires replacing YOUR_ACCOUNT with the appropriate account code.

BioCLIP 2 text embeddings of TreeOfLife-200M were generated with make_txt_embedding.py, using the txt_emb_species.json to provide the species names. More information about the JSON is provided in the TreeOfLife-200M embeddings/README.

Webdataset Construction

The requirements and config files for taking the TreeOfLife structured dataset and putting it in webdataset format are requirements_tol2webdataset.txt and tol2webdataset_full_224.yaml, respectively. The code for this is run through scripts/t2w_submit.sh using the tol2webdataset scripts and modules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Processing Tools

Content-Based Filtering

Data Directory

Docs Directory

Notebooks

Scripts

Webdataset Construction

FilesExpand file tree

processing

Directory actions

More options

Directory actions

More options

Latest commit

History

processing

Folders and files

parent directory

README.md

Processing Tools

Content-Based Filtering

Data Directory

Docs Directory

Notebooks

Scripts

Webdataset Construction