Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

WebGraph

crates.io badge docs.rs badge rustc badge CI badge license badge downloads badge coveralls badge

A Rust implementation of the WebGraph framework for graph compression.

WebGraph is a framework for graph compression aimed at studying web graphs, but currently being applied to several other types of graphs. It provides simple ways to manage very large graphs, exploiting modern compression techniques. More precisely, it is currently made of:

  • Algorithms for compressing web graphs that exploit gap compression and differential compression (à la LINK), intervalization, and ζ codes to provide a high compression ratio (see our datasets). The algorithms are controlled by several parameters, which provide different tradeoffs between access speed and compression ratio.

  • Algorithms for accessing a compressed graph without actually decompressing it, using lazy techniques that delay the decompression until it is actually necessary.

  • Algorithms for analyzing very large graphs, such as HyperBall, which has been used to show that Facebook has just four degrees of separation.

  • A Java implementation of the algorithms above, now in maintenance mode.

  • This crate, providing a complete, documented implementation of the algorithms above in Rust. It is free software distributed under either the GNU Lesser General Public License 2.1+ or the Apache Software License 2.0.

  • Data sets for large graphs (e.g., billions of links).

Users of WebGraph

        

Citation

You are welcome to use and improve WebGraph for your research work! If you find our software useful for research, please cite the following papers in your own:

Loading a compressed graph

A graph in BV format consists of a BASENAME.graph file (the compressed bitstream), a BASENAME.properties file (metadata), and a BASENAME.offsets file (pointers into the bitstream). You can download graphs from the LAW web site or the Common Crawl web site.

For random access to the successors of a node, you also need a BASENAME.ef file containing an Elias–Fano representation of the offsets. You can build it with the command-line interface (webgraph build ef BASENAME) or programmatically with store_ef.

To load a graph with random access:

let graph = BvGraph::with_basename("BASENAME").load()?;

BvGraph::with_basename returns a LoadConfig that can be further customized, selecting endianness, type of memory access (memory mapping, full loading, etc.), and so on. By default you will get big endianness, memory mapping, and dynamic code dispatch.

If you only need sequential access (e.g., scanning all arcs), use BvGraphSeq, which does not require the .ef file:

let graph = BvGraphSeq::with_basename("BASENAME").load()?;

You can retrieve the successors of a node or iterate on the whole graph. In particular, using the handy for_ macro:

for_![(src, succ) in graph {
    for dst in succ {
        [do something with the arc src -> dst]
    }
}];

Note that on Windows memory mapping requires that the length of the graph file is a multiple of the internal bit buffer. You can use the CLI command run pad u32 to ensure that your graph file is properly padded.

In-memory graph representations

Several structures are available for building graphs in memory:

  • VecGraph and LabeledVecGraph: mutable graphs backed by vectors; arcs must be added in increasing successor order. Serializable with ε-serde.

  • BTreeGraph and LabeledBTreeGraph: mutable graphs backed by B-tree maps; arcs can be added in any order.

  • CsrGraph: a classical immutable Compressed Sparse Row representation, useful for algorithms that need fast random access without compression overhead. Serializable with ε-serde.

All these types (except ArcListGraph) can also be serialized with serde using the feature gate serde.

Importing your data

The command-line interface provides a from arcs subcommand that reads tab-separated arcs from standard input and compresses them directly into BV format.

From code, you have several options depending on how your data is organized:

  • If you can generate arcs in sorted order (by source, then by target), wrap your iterator in an ArcListGraph and pass it directly to BvCompConfig::comp_graph; no intermediate storage is needed.

  • If your arcs are unsorted, ParSortedGraph::from_pairs accepts an iterator on (usize, usize) pairs, sorts them, and returns a graph ready for parallel compression.

  • If you can produce arcs in parallel as a Rayon parallel iterator, ParSortedGraph::par_from_pairs does the same with parallel sorting.

In all cases the result implements IntoParLenders, so it can be passed to BvCompConfig::par_comp for parallel compression. For full control over deduplication, memory budget, and progress logging, use the ParSortedGraphConf builder obtained via ParSortedGraph::config().

All of the above also works for labeled graphs: just use ((usize, usize), L) pairs instead of (usize, usize) and the corresponding ParSortedLabeledGraph methods.

Compressing graphs

The entry point for compression is BvComp::with_basename, which returns a BvCompConfig builder. Compression parameters (window size, codes for different components, etc.) are controlled by CompFlags. compression paramters are described in detail in the bvgraph module documentation.

To compress sequentially a graph that implements SequentialGraph:

BvComp::with_basename("BASENAME")
    .comp_graph::<BE>(graph)?;

For better compression ratios at the cost of a longer compression time, use BvCompConfig::with_bvgraphz, which enables Zuckerli-like dynamic-programming reference selection.

Parallel compression

Any graph implementing IntoParLenders can be compressed in parallel:

BvComp::with_basename("BASENAME")
    .par_comp::<BE, _>(&graph)?;

All common graph types implement IntoParLenders automatically via SplitLabeling. Transparent wrappers can adjust the splitting:

Labeled compression

Graphs with labels can be compressed with BvCompConfig::comp_labeled_graph (sequential) or BvCompConfig::par_comp_labeled (parallel). Label storage is controlled by a StoreLabelsConfig implementation; BitStreamStoreLabelsConfig provides bitstream-based storage with optional Zstandard compression.

Graph transforms

The transform module provides common graph operations, each available in sequential and parallel (SplitLabeling-based) variants:

Graph and data wrappers

Lightweight wrappers combine or modify data without copying:

  • PermutedGraph: applies a node permutation lazily.
  • UnionGraph: merges arcs from two graphs.
  • NoSelfLoopsGraph: filters out self-loops.
  • JavaPermutation: reads Java WebGraph permutation files.
  • ArcListGraph: a lazy sequential graph backed by an iterator of arcs, useful for feeding data directly into the compressor without materializing the graph in memory.

Labels

Graphs are a special case of labelings: a SequentialLabeling assigns a sequence of labels to each node, and a graph is simply a labeling whose labels are usize successors. You can attach additional labels to a graph by zipping it with a labeling.

Bitstream-based labelings are provided by BitStreamLabeling (random access) and BitStreamLabelingSeq (sequential). Custom label (de)serializers implement the BitDeserializer/BitSerializer traits. The Left and Right projections extract one component from a zipped labeling.

Graph traversals

The visits module provides breadth-first and depth-first traversals, both sequential and parallel:

Visits use a callback-based API with event types (EventPred/EventNoPred) that carry predecessor/parent information when requested.

Graph iterators

BfsOrder and DfsOrder wrap the visit machinery into standard iterators that yield nodes in BFS or DFS order. BfsOrderFromRoots starts the visit from a given set of roots.

Random graphs

ErdosRenyi generates Erdős–Rényi random graphs with a given number of nodes and edge probability, useful for testing and benchmarking.

Utilities

Command-line interface

We provide a command-line interface to perform various operations on unlabeled graphs (compression, transformation, analysis, etc.).

Acknowledgments

This software has been partially supported by project SERICS (PE00000014) under the NRRP MUR program funded by the EU - NGEU, and by project ANR COREGRAPHIE, grant ANR-20-CE23-0002 of the French Agence Nationale de la Recherche. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the Italian MUR. Neither the European Union nor the Italian MUR can be held responsible for them.