A Rust implementation of the WebGraph framework for graph compression.
WebGraph is a framework for graph compression aimed at studying web graphs, but currently being applied to several other types of graphs. It provides simple ways to manage very large graphs, exploiting modern compression techniques. More precisely, it is currently made of:
-
Algorithms for compressing web graphs that exploit gap compression and differential compression (à la LINK), intervalization, and ζ codes to provide a high compression ratio (see our datasets). The algorithms are controlled by several parameters, which provide different tradeoffs between access speed and compression ratio.
-
Algorithms for accessing a compressed graph without actually decompressing it, using lazy techniques that delay the decompression until it is actually necessary.
-
Algorithms for analyzing very large graphs, such as HyperBall, which has been used to show that Facebook has just four degrees of separation.
-
A Java implementation of the algorithms above, now in maintenance mode.
-
This crate, providing a complete, documented implementation of the algorithms above in Rust. It is free software distributed under either the GNU Lesser General Public License 2.1+ or the Apache Software License 2.0.
-
Data sets for large graphs (e.g., billions of links).
You are welcome to use and improve WebGraph for your research work! If you find our software useful for research, please cite the following papers in your own:
-
"WebGraph: The Next Generation (Is in Rust)", by Tommaso Fontana, Sebastiano Vigna, and Stefano Zacchiroli, in WWW '24: Companion Proceedings of the ACM on Web Conference 2024, pages 686–689. DOI 10.1145/3589335.3651581.
-
"The WebGraph Framework I: Compression Techniques", by Paolo Boldi and Sebastiano Vigna, in Proc. of the 13th international conference on World Wide Web, WWW 2004, pages 595–602, ACM. DOI 10.1145/988672.988752.
A graph in BV format consists of a BASENAME.graph file (the compressed
bitstream), a BASENAME.properties file (metadata), and a BASENAME.offsets
file (pointers into the bitstream). You can download graphs from the LAW web
site or the Common Crawl web site.
For random access to the successors of a node, you also need a BASENAME.ef
file containing an Elias–Fano representation of the offsets. You can build it
with the command-line interface (webgraph build ef BASENAME) or
programmatically with store_ef.
To load a graph with random access:
let graph = BvGraph::with_basename("BASENAME").load()?;BvGraph::with_basename returns a LoadConfig that can be further
customized, selecting endianness, type of memory access (memory mapping, full
loading, etc.), and so on. By default you will get big endianness, memory
mapping, and dynamic code dispatch.
If you only need sequential access (e.g., scanning all arcs), use
BvGraphSeq, which does not require the .ef file:
let graph = BvGraphSeq::with_basename("BASENAME").load()?;You can retrieve the successors of a node or iterate on the whole
graph. In particular, using the handy for_ macro:
for_![(src, succ) in graph {
for dst in succ {
[do something with the arc src -> dst]
}
}];Note that on Windows memory mapping requires that the length of the graph file
is a multiple of the internal bit buffer. You can use the CLI command run pad u32 to ensure that your graph file is properly padded.
Several structures are available for building graphs in memory:
-
VecGraphandLabeledVecGraph: mutable graphs backed by vectors; arcs must be added in increasing successor order. Serializable with ε-serde. -
BTreeGraphandLabeledBTreeGraph: mutable graphs backed by B-tree maps; arcs can be added in any order. -
CsrGraph: a classical immutable Compressed Sparse Row representation, useful for algorithms that need fast random access without compression overhead. Serializable with ε-serde.
All these types (except ArcListGraph) can also be serialized with serde
using the feature gate serde.
The command-line interface provides a from arcs subcommand that reads
tab-separated arcs from standard input and compresses them directly into BV
format.
From code, you have several options depending on how your data is organized:
-
If you can generate arcs in sorted order (by source, then by target), wrap your iterator in an
ArcListGraphand pass it directly toBvCompConfig::comp_graph; no intermediate storage is needed. -
If your arcs are unsorted,
ParSortedGraph::from_pairsaccepts an iterator on(usize, usize)pairs, sorts them, and returns a graph ready for parallel compression. -
If you can produce arcs in parallel as a Rayon parallel iterator,
ParSortedGraph::par_from_pairsdoes the same with parallel sorting.
In all cases the result implements IntoParLenders, so it can be passed to
BvCompConfig::par_comp for parallel compression. For full control over
deduplication, memory budget, and progress logging, use the
ParSortedGraphConf builder obtained via ParSortedGraph::config().
All of the above also works for labeled graphs: just use ((usize, usize), L)
pairs instead of (usize, usize) and the corresponding
ParSortedLabeledGraph methods.
The entry point for compression is BvComp::with_basename, which returns a
BvCompConfig builder. Compression parameters (window size, codes for
different components, etc.) are controlled by CompFlags. compression
paramters are described in detail in the bvgraph module documentation.
To compress sequentially a graph that implements SequentialGraph:
BvComp::with_basename("BASENAME")
.comp_graph::<BE>(graph)?;For better compression ratios at the cost of a longer compression time, use
BvCompConfig::with_bvgraphz, which enables Zuckerli-like
dynamic-programming reference selection.
Any graph implementing IntoParLenders can be compressed in parallel:
BvComp::with_basename("BASENAME")
.par_comp::<BE, _>(&graph)?;All common graph types implement IntoParLenders automatically via
SplitLabeling. Transparent wrappers can adjust the splitting:
ParGraph: overrides the number of parallel parts.ParDcfGraph: uses a precomputed degree cumulative function to balance work by arcs rather than by nodes.
Graphs with labels can be compressed with
BvCompConfig::comp_labeled_graph (sequential) or
BvCompConfig::par_comp_labeled (parallel). Label storage is controlled by a
StoreLabelsConfig implementation; BitStreamStoreLabelsConfig provides
bitstream-based storage with optional Zstandard compression.
The transform module provides common graph operations, each available in
sequential and parallel (SplitLabeling-based) variants:
- Transpose: reverse all arcs
(
transpose_splitfor parallel,transpose_labeledfor labeled graphs). - Symmetrize: add missing reverse arcs, optionally removing
self-loops (
symmetrize_splitfor parallel). - Permute: renumber nodes according to a permutation
(
permute_splitfor parallel). - Map: renumber nodes through an arbitrary function, with
deduplication (
map_splitfor parallel).
Lightweight wrappers combine or modify data without copying:
PermutedGraph: applies a node permutation lazily.UnionGraph: merges arcs from two graphs.NoSelfLoopsGraph: filters out self-loops.JavaPermutation: reads Java WebGraph permutation files.ArcListGraph: a lazy sequential graph backed by an iterator of arcs, useful for feeding data directly into the compressor without materializing the graph in memory.
Graphs are a special case of labelings: a SequentialLabeling assigns a
sequence of labels to each node, and a graph is simply a labeling whose labels
are usize successors. You can attach additional labels to a graph by
zipping it with a labeling.
Bitstream-based labelings are provided by BitStreamLabeling (random access)
and BitStreamLabelingSeq (sequential). Custom label (de)serializers
implement the BitDeserializer/BitSerializer traits. The Left and
Right projections extract one component from a zipped labeling.
The visits module provides breadth-first and depth-first traversals, both
sequential and parallel:
- BFS:
breadth_first::Seq,breadth_first::ParFair(fair division of work) andbreadth_first::ParLowMem(memory-efficient). - DFS:
depth_first::SeqIter.
Visits use a callback-based API with event types (EventPred/EventNoPred)
that carry predecessor/parent information when requested.
BfsOrder and DfsOrder wrap the visit machinery into standard iterators
that yield nodes in BFS or DFS order. BfsOrderFromRoots starts the visit
from a given set of roots.
ErdosRenyi generates Erdős–Rényi random graphs with a given number of nodes
and edge probability, useful for testing and benchmarking.
par_map_fold: parallel map-fold over iterators, an alternative to Rayon'sParallelBridgethat avoids some of its bottlenecks.graph::eq,graph::eq_labeled,labels::eq_sorted: equality checks for graphs and labelings, useful in tests.labels::check_impl: verifies that the sequential and random-access implementations of a labeling are consistent.
We provide a command-line interface to perform various operations on unlabeled graphs (compression, transformation, analysis, etc.).
This software has been partially supported by project SERICS (PE00000014) under the NRRP MUR program funded by the EU - NGEU, and by project ANR COREGRAPHIE, grant ANR-20-CE23-0002 of the French Agence Nationale de la Recherche. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the Italian MUR. Neither the European Union nor the Italian MUR can be held responsible for them.