From 7b3095ac12a38f4031148778298535b03481e821 Mon Sep 17 00:00:00 2001
From: Merope Riddle <merope07@protonmail.com>
Date: Sat, 5 Nov 2016 20:30:40 +0000
Subject: [PATCH] Add design document for Merkle tree structures

---
 doc/merkle.md | 157 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 157 insertions(+)
 create mode 100644 doc/merkle.md

diff --git a/doc/merkle.md b/doc/merkle.md
new file mode 100644
index 000000000..e424a2890
--- /dev/null
+++ b/doc/merkle.md
@@ -0,0 +1,157 @@
+# Merkle Structures
+
+MimbleWimble is designed for users to verify the state of the system given
+only pruned data. To achieve this goal, all transaction data is committed
+to the blockchain by means of Merkle trees which should support efficient
+updates and serialization even when pruned.
+
+Also, almost all transaction data (inputs, outputs, excesses and excess
+proofs) have the ability to be summed in some way, so it makes sense to
+treat Merkle sum trees as the default option, and address the sums here.
+
+A design goal of Grin is that all structures be as easy to implement and
+as simple as possible. MimbleWimble introduces a lot of new cryptography
+so it should made as easy to understand as possible. Its validation rules
+are simple to specify (no scripts) and Grin is written in a language with
+very explicit semantics, so simplicity is also good to achieve well-understood
+consensus rules.
+
+## Merkle Trees
+
+There are four Merkle trees committed to by each block:
+
+### Total Output Set
+
+Each object is one of two things: a commitment indicating an unspent output
+or a NULL marker indicating a spent one. It is a sum-tree over all unspent
+outputs (spent ones contribute nothing to the sum). The output set should
+reflect the state of the chain *after* the current block has taken effect.
+
+The root sum should be equal to the sum of all excesses since the genesis.
+
+Design requirements:
+
+1. Efficient additions and updating from unspent to spent
+2. Efficient proofs that a specific output was spent
+3. Efficient storage of diffs between UTXO roots.
+4. Efficient tree storage even with missing data, even with millions of entries.
+5. If a node commits to NULL, it has no unspent children and its data should
+   eventually be able to be dropped forever.
+6. Support serializating and efficient merging of pruned trees from partial
+   archival nodes.
+
+### Output witnesses
+
+This tree mirrors the total output set but has rangeproofs in place of commitments.
+It is never updated, only appended to, and does not sum over anything. When an
+output is spent it is sufficient to prune its rangeproof from the tree rather
+than deleting it.
+
+Design requirements:
+
+1. Support serializating and efficient merging of pruned trees from partial
+   archival nodes.
+
+### Inputs and Outputs
+
+Each object is one of two things: an input (unambiguous reference to an old
+transaction output), or an output (a (commitment, rangeproof) pair). It is
+a sum-tree over the commitments of outputs, and the negatives of the commitments
+of inputs.
+
+Input references are hashes of old commitments. It is a consensus rule that
+there are never two identical unspent outputs.
+
+The root sum should be equal to the sum of excesses for this block. See the
+next section.
+
+In general, validators will see either 100% of this Merkle tree or 0% of it,
+so it is compatible with any design. Design requirements:
+
+1. Efficient inclusion proofs, for proof-of-publication.
+
+### Excesses
+
+Each object is of the form (excess, signature). It is a sum tree over the
+excesses.
+
+In general, validators will always see 100% of this tree, so it is not even
+necessary to have a Merkle structure at all. However, to support partial
+archival nodes in the future we want to support efficient pruning.
+
+Design requirements:
+
+1. Support serializating and efficient merging of pruned trees from partial
+   archival nodes.
+
+
+## Proposed Merkle Structure
+
+**The following design is proposed for all trees: a sum-MMR where every node
+sums a count of its children _as well as_ the data it is supposed to sum.
+The result is that every node commits to the count of all its children.**
+
+[MMRs, or Merkle Mountain Ranges](https://github.com/opentimestamps/opentimestamps-server/blob/master/doc/merkle-mountain-range.md)
+
+The six design criteria for the output set are:
+
+### Efficient insert/updates
+
+Immediate (as is proof-of-inclusion). This is true for any balanced Merkle
+tree design.
+
+### Efficient proof-of-spentness
+
+Grin itself does not need proof-of-spentness but it is a good thing to support
+in the future for SPV clients.
+
+The children-counts imply an index of each object in the tree, which does not
+change because insertions happen only at the far right of the tree.
+
+This allows permanent proof-of-spentness, even if an identical output is later
+added to the tree, and prevents false proofs even for identical outputs. These
+properties are hard to achieve for a non-insertion-ordered tree.
+
+### Efficient storage of diffs
+
+Storing complete blocks should be sufficient for this. Updates are obviously
+as easy to undo as they are to do, and since blocks are always processed in
+order, rewinding them during reorgs is as simple as removing a contiguous
+set of outputs from the right of the tree. (This should be even faster than
+repeated deletions in a tree designed to support deletions.)
+
+### Efficient tree storage even with missing data
+
+To update the root hash when random outputs are spent, we do not want to need
+to store or compute the entire tree. Instead we can store only the hashes at
+depth 20, say, of which there will be at most a million. Then each update only
+needs to recompute hashes above this depth (Bitcoin has less than 2^29 outputs
+in its history, so this means computing a tree of size 2^9 = 512 for each update)
+and after all updates are done, the root hash can be recomputed.
+
+This depth is configurable and may be changed as the output set grows, or
+depending on available disk space.
+
+This is doable for any Merkle tree but may be complicated by PATRICIA trees or
+other prefix trees, depending how depth is computed.
+
+### Dropping spent coins
+
+Since coins never go from spent to unspent, the data on spent coins is not needed
+for any more updates or lookups.
+
+### Efficient serialization of pruned trees
+
+Since every node has a count of its children, validators can determine the
+structure of the tree without needing all the hashes, and can determine which
+nodes are siblings, and so on.
+
+In the output set each node also commits to a sum of its unspent children, so
+a validator knows if it is missing data on unspent coins, by checking whether
+this sum on a pruned node is zero or not.
+
+
+## Algorithms
+
+(To appear alongside an implementation.)
+