Package patricia-tree
This library contains a single module: PatriciaTree
.
This is version 0.10.0
of the library. It is known to work with OCaml versions ranging from 4.14
to 5.2
.
This is an OCaml library that implements sets and maps as Patricia Trees, as described in Okasaki and Gill's 1998 paper Fast mergeable integer maps. It is a space-efficient prefix trie over the big-endian representation of the key's integer identifier.
The source code of this library is available on Github under an LGPL-2.1 license.
This library was written by Matthieu Lemerre, then further improved by Dorian Lesbre, as part of the Codex semantics library, developed at CEA List.
Installation
This library can be installed with opam:
opam install patricia-tree
Alternatively, you can clone the source repository and install with dune:
git clone git@github.com:codex-semantics-library/patricia-tree.git
cd patricia-tree
opan install . --deps-only
dune build -p patricia-tree
dune install
# To build documentation
opam install . --deps-only --with-doc
dune build @doc
See the examples to jump right into using this library.
Features
- Similar to OCaml's
Map
andSet
, using the same function names when possible and the same convention for order of arguments. This should allow switching to and from Patricia Tree with minimal effort. - The functor parameters (
KEY
module) requires an injectiveto_int : t -> int
function instead of acompare
function.KEY.to_int
should be fast, and injective. This works well with hash-consed types. The Patricia Tree representation is stable, contrary to maps, inserting nodes in any order will return the same shape. This allows different versions of a map to share more subtrees in memory, and the operations over two maps to benefit from this sharing. The functions in this library attempt to maximally preserve sharing and benefit from sharing, allowing very important improvements in complexity and running time when combining maps or sets is a frequent operation.
To do so, these functions often have extra requirements on their argument (e.g.
inter f m1 m2
can be optimized by not inspecting common subtrees whenf
is idempotent). To avoid accidental errors, they are renamed (e.g. toidempotent_inter
for the efficient version andnonidempotent_inter_no_share
for the general one)Since our Patricia Tree use big-endian order on keys, the maps and sets are sorted in increasing unsigned order of keys. This means negative keys are sorted above positive keys, with
-1
being the largest possible key, and0
the smallest. This also avoids a bug in Okasaki's paper discussed in QuickChecking Patricia Trees by Jan Mitgaard.It also affects functions like
unsigned_min_binding
andpop_unsigned_minimum
. They will return the smallest positive integer of both positive and negative keys are present; and not the smallest negative, as one might expect.- Supports generic maps and sets: a
'm map
that maps'k key
to('k, 'm) value
. This is especially useful when using GADTs for the type of keys. This is also sometimes called a dependent map. - Allows easy and fast operations across different types of maps and set which have the same type of keys (e.g. an intersection between a map and a set).
- Multiple choices for internal representation (
NODE
), which allows for efficient storage (no need to store a value for sets), or using weak nodes only (values removed from the tree if no other pointer to it exists). This system can also be extended to store size information in nodes if needed. - Exposes a common interface (
PatriciaTree.NODE.view
) to allow users to write their own pattern matching on the tree structure without depending on theNODE
being used. - Additionally, hashconsed versions of heterogeneous/homogeneous maps/sets are available. These provide constant time equality and comparison, and ensure maps/set with the same constants are always physically equal. It comes at the cost of a constant overhead in memory usage (at worst, as hash-consing may allow memory gains) and constant time overhead when calling constructors.
Quick overview
Functors
This library contains a single module, PatriciaTree
. The functors used to build maps and sets are the following:
For homogeneous (non-generic) maps and sets:
MakeMap
andMakeSet
. These are similar to the standard library's maps and sets.module MakeMap(Key: KEY) : MAP with type key = Key.t module MakeSet(Key: KEY) : SET with type elt = Key.t
For Heterogeneous (generic) maps and sets:
MakeHeterogeneousMap
andMakeHeterogeneousSet
.module MakeHeterogeneousMap(Key: HETEROGENEOUS_KEY)(Value: HETEROGENEOUS_VALUE) : HETEROGENEOUS_MAP with type 'a key = 'a Key.t and type ('k,'m) value = ('k,'m) Value.t module MakeHeterogeneousSet(Key: HETEROGENEOUS_KEY) : HETEROGENEOUS_SET with type 'a elt = 'a Key.t
There are also hash-consed versions of these four functors:
MakeHashconsedMap
,MakeHashconsedSet
,MakeHashconsedHeterogeneousMap
andMakeHashconsedHeterogeneousSet
. These uniquely number their nodes, which means:equal
andcompare
become constant time operations;- two maps with the same bindings (where keys are compared by
KEY.to_int
and values byHASHED_VALUE.polyeq
) will always be physically equal; - functions that benefit from sharing will see improved performance;
- constructors are slightly slower, as they now require a hash-table lookup;
- memory usage is increased: nodes store their tags inside themselves, and a global hash-table of all built nodes must be maintained;
- hash-consed maps assume their values are immutable;
- WARNING: when using physical equality as
HASHED_VALUE.polyeq
, some maps of different types may be given the same identifier. See the end of the documentation ofHASHED_VALUE.polyeq
for details. Note that this is the case in the default implementationsHashedValue
andHeterogeneousHashedValue
. - All hash-consing functors are generative, since each functor call will create a new hash-table to store the created nodes. Calling a functor twice with same arguments will lead to two numbering systems for identifiers, and thus the types should not be considered compatible.
Interfaces
Here is a brief overview of the various module types of our library:
BASE_MAP
: the underlying module type of all our trees (maps end sets). It represents a'b map
binding'a key
to('a,'b) value
, as well as all functions needed to manipulate them.It can be accessed from any of the more specific maps types, thus providing a unified representation, useful for cross map operations. However, for practical purposes, it is often best to use the more specific interfaces:
HETEROGENEOUS_MAP
for heterogeneous maps (this is justBASE_MAP
with aWithForeign
functor).MAP
for homogeneous maps, this interface is close toStdlib.Map.S
.HETEROGENEOUS_SET
for heterogeneous sets (sets of'a elt
). These are just maps tounit
, but with a custom node representation to avoid storingunit
in nodes.SET
for homogeneous sets, this interface is close toStdlib.Set.S
.
The parameter of our functor are either
KEY
orHETEROGENEOUS_KEY
. These just consist of a type, a (polymorphic) equality function, and an injectiveto_int
coercion.The heterogeneous map functor also has a
HETEROGENEOUS_VALUE
parameter to specify the('a, 'b) value
type.The internal representations of our tree can be customized to use different internal
NODE
. Each node come with its own private constructors and destructors, as well as a cast to a uniformNODE.view
type used for pattern matching.A number of implementations are provided:
SimpleNode
: exactly theNODE.view
type;WeakNode
: only store weak pointer to its elements;NodeWithId
: node which contains a unique identifier;SetNode
: optimized for sets, doesn't store theunit
value;WeakSetNode
: both aWeakNode
and aSetNode
HashconsedNode
: performs hash-consing (it also stores a unique identifier, but checks when building a new node whether a node with similar content already exists);HashconsedSetNode
: both aHashconsedNode
and aSetNode
.
Use the functors
MakeCustomMap
andMakeCustomSet
(or their heterogeneous versionsMakeCustomHeterogeneousMap
andMakeCustomHeterogeneousSet
) to build maps using these nodes, or any other custom nodes.
Examples
To use this library, install it and add the following to your dune files:
(executable ; or library
...
(libraries patricia-tree ...)
)
Homogeneous map
Here is a small example of a non-generic map:
Start by creating a key module:
module IntKey : PatriciaTree.KEY with type t = int = struct type t = int let to_int x = x end
Use it to instanciate the map/set functors:
module IMap : PatriciaTree.MAP with type key = int = PatriciaTree.MakeMap(IntKey);; module ISet : PatriciaTree.SET with type elt = int = PatriciaTree.MakeSet(IntKey);;
You can now use it as you would any other map:
# let map = IMap.empty |> IMap.add 1 "hello" |> IMap.add 2 "world" |> IMap.add 3 "how do you do?";; val map : string IMap.t = <abstr>
(We also have
of_list
andof_seq
functions for quick initialization)# IMap.find 1 map;; - : string = "hello" # IMap.cardinal map;; - : int = 3
The strength of Patricia Tree is the speedup of operations on multiple maps with common subtrees. For example, in the following, the
idempotent_inter_filter
function will skip recursive calls to physically equal subtrees (kept as-is in the intersection). This allows faster thanO(n)
intersections.# let map2 = IMap.idempotent_inter_filter (fun _key _l _r -> None) (IMap.add 4 "something" map) (IMap.add 5 "something else" map);; val map2 : string IMap.t = <abstr> # map == map2;; - : bool = true
Physical equality is preserved as much as possible, although some intersections may need to build new nodes and won't be fully physically equal, they will still share some subtrees.
# let str = IMap.find 1 map;; val str : string = "hello" # IMap.add 1 str map == map (* already present *);; - : bool = true # IMap.add 1 "hello" map == map (* new string copy isn't physically equal to the old one *);; - : bool = false
Note that physical equality isn't preserved when creating new copies of values (the newly created string
"hello"
isn't physically equal tostr
). It can also fail when maps have the same bindings but were created differently:# let map3 = IMap.remove 2 map;; val map3 : string IMap.t = <abstr> # IMap.add 2 (IMap.find 2 map) map3 == map;; - : bool = false
If you want to maintain full physical equality (and thus get cheap equality test between maps), use the provided hash-consed maps and sets.
Our library also allows cross map/set operations through the
WithForeign
functors:module CrossOperations = IMap.WithForeign(ISet.BaseMap)
For example, you can only keep the bindings of
map
whose keys are in a given set:# let set = ISet.of_list [1; 3];; val set : ISet.t = <abstr> # let restricted_map = CrossOperations.nonidempotent_inter { f = fun _key value () -> value } map set;; val restricted_map : string IMap.t = <abstr> # IMap.to_list map;; - : (int * string) list = [(1, "hello"); (2, "world"); (3, "how do you do?")] # IMap.to_list restricted_map;; - : (int * string) list = [(1, "hello"); (3, "how do you do?")]
Heterogeneous map
Heterogeneous maps work very similarly to homogeneous ones, but come with extra liberty of having a generic type as a key.
Here is a GADT example to use for our keys: a small typed expression language.
type 'a expr = | G_Const_Int : int -> int expr | G_Const_Bool : bool -> bool expr | G_Addition : int expr * int expr -> int expr | G_Equal : 'a expr * 'a expr -> bool expr
We can create our
HETEROGENEOUS_KEY
functor parameter using this type has follows:module Expr : PatriciaTree.HETEROGENEOUS_KEY with type 'a t = 'a expr = struct type 'a t = 'a expr (** Injective, so long as expressions are small enough (encodes the constructor discriminant in two lowest bits). Ideally, use a hash-consed type, to_int needs to be fast *) let rec to_int : type a. a expr -> int = function | G_Const_Int i -> 0 + 4*i | G_Const_Bool b -> 1 + 4*(if b then 1 else 0) | G_Addition(l,r) -> 2 + 4*(to_int l mod 10000 + 10000*(to_int r)) | G_Equal(l,r) -> 3 + 4*(to_int l mod 10000 + 10000*(to_int r)) (** Full polymorphic equality *) let rec polyeq : type a b. a expr -> b expr -> (a, b) PatriciaTree.cmp = fun l r -> match l, r with | G_Const_Int l, G_Const_Int r -> if l = r then Eq else Diff | G_Const_Bool l, G_Const_Bool r -> if l = r then Eq else Diff | G_Addition(ll, lr), G_Addition(rl, rr) -> ( match polyeq ll rl with | Eq -> polyeq lr rr | Diff -> Diff) | G_Equal(ll, lr), G_Equal(rl, rr) -> ( match polyeq ll rl with | Eq -> (match polyeq lr rr with Eq -> Eq | Diff -> Diff) (* Match required by typechecker *) | Diff -> Diff) | _ -> Diff end
We can now instanciate our map functor. Note that in the heterogeneous case, we must also specify the value type (second functor argument) and how it depends on the key type (first parameter) and the map type (second parameter). Here the value only depends on the type of the key, not that of the map
module EMap = PatriciaTree.MakeHeterogeneousMap(Expr)(struct type ('a, _) t = 'a end)
You can now use this as you would any other dependent map:
# let map : unit EMap.t = EMap.empty |> EMap.add (G_Const_Bool false) false |> EMap.add (G_Const_Int 5) 5 |> EMap.add (G_Addition (G_Const_Int 3, G_Const_Int 6)) 9 |> EMap.add (G_Equal (G_Const_Bool false, G_Equal (G_Const_Int 5, G_Const_Int 7))) true val map : unit EMap.t = <abstr> # EMap.find (G_Const_Bool false) map;; - : bool = false # EMap.find (G_Const_Int 5) map;; - : int = 5 # EMap.cardinal map;; - : int = 4
Physical equality preservation allows fast operations on multiple maps with common ancestors. In the heterogeneous case, these functions are a bit more complex since OCaml requires that first-order polymorphic functions be wrapped in records:
# let map2 = EMap.idempotent_inter_filter { f = fun _key _l _r -> None } (* polymorphic 1rst order functions are wrapped in records *) (EMap.add (G_Const_Int 0) 8 map) (EMap.add (G_Const_Int 0) 9 map) val map2 : unit EMap.t = <abstr>
Even though
map
andmap2
have the same elements, they may not always be physically equal:# map == map2;; - : bool = false
This is because they were created through different processes. They will still share subtrees. If you want to maintain full physical equality (and thus get cheap equality test between maps), use the provided hash-consed maps and sets.
Release status
This should be close to a stable release. It is already being used as part of a larger project successfully, and this usage as helped us mature the interface. As is, we believe the project is usable, and we don't anticipate any major change before 1.0.0. We didn't commit to a stable release straight away as we would like a bit more time using this library before doing so.
Known issues
There is a bug in the OCaml typechecker which prevents us from directly defining non-generic maps as instances of generic maps. To avoid this, non-generic maps use a separate value type ('a, 'b) snd
(instead of just using 'b
)
type (_, 'b) snd = Snd of 'b [@@unboxed]
It should not incur any extra performance cost as it is unboxed, but can appear when manipulating non-generic maps.
For more details about this issue, see the OCaml discourse discussion.
Comparison to other OCaml libraries
ptmap and ptset
There are other implementations of Patricia Tree in OCaml, namely ptmap and ptset, both by J.C. Filliatre. These are smaller and closer to OCaml's built-in Map
and Set
, however:
- Our library allows using any type
key
that comes with an injectiveto_int
function, instead of requiringkey = int
. - We support generic types for keys/elements.
- We support operations between sets and maps of different types.
- We use a big-endian representation, allowing easy access to min/max elements of maps and trees.
- Our interface and implementation tries to maximize the sharing between different versions of the tree, and to benefit from this memory sharing. Theirs do not.
- These libraries work with older version of OCaml (
>= 4.05
I believe), whereas ours requires OCaml>= 4.14
(for the new interface ofEphemeron
used inWeakNode
).
dmap
Additionally, there is a dependent map library: dmap, which gave us the idea of making our PatriciaTree dependent. It allows creating type safe dependent maps similar to our heterogeneous maps. However, its maps aren't Patricia trees. They are binary trees build using a (polymorphic) comparison function, similarly to the maps of the standard library.
Another difference is that the type of values in the map is independent from the type of the keys, allowing keys to be associated with different values in different maps. i.e. we map 'a key
to any ('a, 'b) value
type, whereas dmap only maps 'a key
to 'a
or 'a value
.
dmap
also works with OCaml >= 4.12
, whereas we require OCaml >= 4.14
.
Contributions and bug reports
Any contributions are welcome!
You can report any bug, issues, or desired features using the Github issue tracker. Please include OCaml, dune, and library version information in you bug reports.
If you want to contribute code, feel free to fork the repository on Github and open a pull request. By doing so you agree to release your code under this project's license (LGPL-2.1).
There is no imposed coding style for this repository, here are just a few guidelines and conventions:
- Module type names should use
SCREAMING_SNAKE_CASE
. - Module and functor names use
PascalCase
, functors names start withMake
. - Even though the library implements homogeneous maps as a specialization of heterogeneous ones, the naming convention is that no prefix means homogeneous, and all heterogeneous objects are prefixed with
heterogeneous
. - Please document any new functions in the interface, using ocamldoc style comments.
- Please consider adding test for new features/fixed bugs if at all possible. This library uses a QuickCheck framework for tests.