Smart Contract Scalability With Block-STM’s Fast Blockchain Commit Process | Chainlink’s Parallel Execution Engine Integration With Meta’s Diem

Chainlink recently announced a new technology: Block-STM. The robust and stable implementation accelerates smart contract execution and emanates from the Diem project. Block-STM is compatible with all existing blockchains, and requires no modification or adoption by miners.

What is Block-STM?

Block-STM is a parallel execution engine for smart contracts, built around the principles of Software Transactional Memory. Transactions are grouped in blocks, each block containing a pre-ordered sequence of transactions TX₁, TX₂, …, TX_n. Transactions consist of smart-contract code that reads and writes to shared memory and their execution results in a read-set and a write-set: the read-set consists of pairs, a memory location and the transaction that wrote it; the write-set consists of pairs, a memory location and a value, that the transaction would record if it became committed.

Block-STM Origin: Parallel Execution & Software Transactional Memory Implementation In Database

An approach pioneered in the Calvin 2012 and Bohm 2014 projects in the context of distributed databases is the foundation of much of what follows. The insightful idea in those projects is to simplify concurrency management by disseminating pre-ordered batches (akin to blocks) of transactions along with pre-estimates of their read- and write- sets. Every database partition can then autonomously execute transactions according to the block pre-order, each transaction waiting only for read dependencies on earlier transactions in the block. The first DiemVM parallel executor implements this approach but it relies on a static transaction analyzer to pre-estimate read/write-sets which is time consuming and can be inexact.

Another work by Dickerson et al. 2017 provides a link from traditional database concurrency to smart-contract parallelism. In that work, a consensus leader (or miner) pre-computes a parallel execution serialization by harnessing optimistic software transactional memory (STM) and disseminates the pre-execution scheduling guidelines to all validator nodes. Later works, including ParBlockchain and OptSmart 2021 add read/write-set dependency tracking during pre-execution and disseminates this information to increase parallelism. Those approaches remove the reliance on static transaction analysis but require a leader to pre-execute blocks.

The Block-STM parallel executor combines the pre-ordered block idea with optimistic STM to enforce the block pre-order of transactions on-the-fly, completely removing the need to pre-disseminate an execution schedule or pre-compute transaction dependencies, while guaranteeing repeatability.

Technical overview

Block pre-order

A parallel execution of the block must yield the same deterministic outcome that preserves a block pre-order, namely, it results in exactly the same read/write sets as a sequential execution. If, in a sequential execution, TX_k reads a value that TX_j wrote, i.e., TX_j is the highest transaction preceding TX_k that writes to this particular memory location, we denote this by:

   TX_j → TX_k

Running example

The following scenario will be used throughout this post to illustrate parallel execution strategies and their effects.

   A block B consisting of ten transactions, TX₁, TX₂, ..., TX₁₀, 
    with the following read/write dependencies:       
    TX₁ → TX₂ → TX₃ → TX4                
    TX₃ → TX₆      
    TX₃ → TX₉

To illustrate execution timelines, we will assume a system with four parallel threads, and for simplicity, that each transaction takes exactly one time-unit to process.

If we knew the above block dependencies in advance, we could schedule an ideal execution of block B along the following time-steps:

Parallel execution of TX₁, TX₅, TX₇, TX₈
Parallel execution of TX₂, TX₁₀
Parallel execution of TX₃
Parallel execution of TX₄, TX₆, TX₉

Optimistic Scheduling

What if dependencies are not known in advance? A correct parallel execution must guarantee that all transactions indeed read the values adhering to the block dependencies. That is, when TX_k reads from memory, it must obtain the value(s) written by TX_j, TX_j → TX_k, if a dependency exists; or the initial value at that memory location when the block execution started, if none.

Block-STM ensures this by employing an optimistic “speculate-validate-redo” approach, executing transactions greedily and optimistically in parallel and then validating, repeating if validation fails. Validation of TX_k re-reads the read-set of TX_k and compares against the original read-set that TX_k obtained in its latest execution. If the comparison fails, TX_k aborts and re-executes.

The key insight is that preordering greatly simplifies optimism!

If validation fails, only higher transactions necessitate revalidation (VALIDAFTER)
Reads obtain the value written by highest preceding transaction, not the last written value (READHIGH)

With this insight, we first consider a straightforward strawman approach. The strawman algorithm uses a centralized dispatcher that coordinates work by parallel threads.

Strawman (S-1) Algorithm

   // Phase 1:      
    dispatch all TX’s for execution in parallel ; wait for completion 
 
   // Phase 2: 
    repeat {
        dispatch all TX's higher than the last failed for read-set validation in parallel ; wait for completion
    } until all read-set validations pass

    read-set validation of TX_j {
        re-read TX_j read-set 
        if read-set differs from original read-set of the latest TX_j execution 
            re-execute TX_j
    }

    execution of TX_j {
        (re-)process TX_j, generating a read-set and write-set
    }

S-1 operates in two master-coordinated phases. Phase 1 executes all transactions optimistically in parallel. Phase 2 repeatedly validates all transactions higher than the last failed optimistically in parallel, re-executing those that fail, until there are no more read-set validation failures.

Recall our running example, a Block B with dependencies TX₁ → TX₂ → TX₃ → {TX₄, TX₆, TX₉}), running with four threads, each transaction taking one time unit (and neglecting all other computation times). A possible execution of S-1 over Block B would proceed along the following time-steps:

Phase 1 starts. parallel execution of TX₁, TX₂, TX₃, TX₄
Parallel execution of TX₅, TX₆, TX₇, TX₈
Parallel execution of TX₉, TX₁₀
Phase 2 starts. In the first loop iteration, parallel read-set validation of all transactions in which TX₂, TX₃, TX₄, TX₆ fail and re-execute
Continued parallel read-set validation of all transactions in which TX₉ fails and re-executes
In the next phase-2 loop iteration, parallel read-set validations of all transactions in which TX₃, TX₄, TX₆ fail and re-execute
In the next phase-2 loop iteration, parallel read-set validations of all transactions in which TX₄, TX₆, TX₉ fail and re-execute
In the last phase-2 loop iteration, parallel read-set validations of all transactions succeed

The Block-STM Algorithm

In the full Block-STM solution, two fundamental improvements are introduced over the above strawman.

Task stealing. The first improvement is to replace both phases with parallel task-stealing by threads. It follows the insight from S-1, distinguishing between a preliminary execution (corresponding to phase 1) and re-execution (following a validation abort). Stealing is coordinated via two synchronization counters, one per task type, nextPrelimExecution (initially 1) and nextValidation (initially n+1).

Stealing enables validation of higher transactions to start immediately upon completion or when lower transactions fail validation. Every (re-)execution of TX_j guarantees that read-set validation of all higher transactions will be dispatched by decreasing nextValidation back to j+1.

Primitive dependency tracking. The second improvement is an extremely simple dependency tracking (no graphs or partial orders) that considerably reduces aborts. When TX_j aborts, the write-set of its latest invocation is marked ABORTED. Higher transactions simply suspend on reading ABORTED mark, waiting until a re-execution of TX_j overwrites it. This is another demonstration of transaction pre-order simplifying matters substantially.

A high-level thread loop with task stealing is the following:

if available, steal next validation task TX_j
else, steal next prelim-execution task TX_j

Supporting dependency managements via ABORTED tagging and early re-validation of aborted tasks is done as follows:

If validation fails, mark write-set ABORTED and immediately decrease nextValidation
If a re-execution write-set different from original, decrease nextValidation again

A more precise description is given in (less than 20 lines of) the pseudo code below. For full details, see the whitepaper.

Block-STM Algorithm

   // per thread main loop: 
    repeat {
        task := "NA"
        // if available, steal next read-set validation task
        atomic { 
            if nextValidation < nextPrelimExecution    
                (task, j) := ("validate", nextValidation)
                nextValidation.increment();
        } 
        if task = "validate"
            validate TX_j 

        // if available, steal next execution task
        else atomic { 
            if nextPrelimExecution <= n         
                (task, j) := ("execute", nextPrelimExecution)
                nextPrelimExecution.increment();
        }
        if task = "execute"
            execute TX_j 
            validate TX_j 
    } until nextPrelimExecution > n, nextValidation > n, and no task is still running

    read-set validation of TX_j {
        re-read TX_j read-set 

        if read-set differs from original read-set of the latest TX_j execution 
            mark the TX_j write-set ABORTED
            atomic { nextValidation := min(nextValidation, j+1) }
            re-execute TX_j
    }

    execution of TX_j {
        (re-)process TX_j, generating a read-set and write-set
        resume any TX waiting for TX_j's ABORTED write-set
        if the new TX_j write-set contains locations not marked ABORTED
            atomic { nextValidation := min(nextValidation, j+1) }
    }

Block-STM enhances efficiency through simple, on-the-fly dependency management using the ABORTED tag. For our running example of block B, an execution with four threads may be able to avoid several of the re-executions incurred in the strawman scenario by waiting on an ABORTED mark. Despite the high-contention B scenario, a possible execution of Block-STM may achieve very close to optimal scheduling as shown below. A possible execution over Block B (recall, TX₁ → TX₂ → TX₃ → {TX₄, TX₆, TX₉}) would proceed along the following time-steps (depicted in the figure):

Parallel execution of TX₁, TX₂, TX₃, TX₄; read-set validation of TX₂, TX₃, TX₄ fail; nextValidation set to 3
Parallel execution of TX₂, TX₅, TX₇, TX₈; executions of TX₃, TX₄, TX₆ are suspended on ABORTED; nextValidation set to 6
Parallel execution of TX₃, TX₁₀; executions of TX₄, TX₆, TX₉ are suspended on ABORTED; all read-set validations succeed (for now)
Parallel execution of TX₄, TX₆, TX₉; all read-set validations succeed

Correctness

Correct optimism revolves around maintaining two principles:

READHIGH(k): Whenever TX_k executes (speculatively), a read by TX_k obtains the value recorded so far by the highest transaction TX_j preceding it, i.e., where j < k. Higher transactions TX_l, where l > k, do not interfere with TX_k.
VALIDAFTER(j, k): For every j,k, such that j < k, a validation of TX_k’s read-set is performed every time TX_j executes (or re-executes).

Jointly, these two principles suffice to guarantee both safety and liveness no matter how transactions are scheduled. Briefly, safety follows because a TX_k gets validated after all TX_j, j < k, are finalized. Liveness follows by induction. Initially transaction 1 is guaranteed to pass read-set validation successfully and not require re-execution. After all transactions from TX₁ to TX_j have successfully validated, a (re-)execution of transaction j+1 will pass read-set validation successfully and not require re-execution.

READHIGH(k) is implemented in Block-STM via a simple multi-version in-memory data structure that keeps versioned write-sets. A write by TX_j is recorded with version j. A read by TX_k obtains the value recorded by the latest invocation of TX_j with the highest j < k.

The special value ABORTED may be stored at version j when the latest invocation of TX_j aborts. If TX_k reads this value, it suspends and resumes when the value becomes set.

VALIDAFTER(j, k) is implemented by scheduling for each TX_j a read-set validation of TX_k, for every k > j, after TX_j completes (re-)executing. Interacting with early re-validation is slightly subtle. Suppose that TX_j → TX_k. Recall, when TX_j fails, Block-STM immediately schedules (re-)validations of TX_k, k > j, before TX_j completes re-execution. There are two possible cases. If a TX_k-validation reads an ABORTED value of TX_j, it will wait for TX_j to complete; and if it reads a value which is not marked ABORTED and the TX_j re-execution has a new write-set, then TX_k will be forced to revalidate again.

Benefits of Block-STM

Simplicity is a virtue of Block-STM, not a failing, enabling a robust and stable implementation. Through a careful combination of simple, known techniques and applying them to a pre-ordered block of transactions that commit at bulk, Block-STM:

enables effective speedup of smart contract parallel processing
enables repeatable execution (required)
turns STM practical because of linear dependencies/validation
commits at bulk per block (rather than per transaction)
benefits clients independently
can work with/over existing blockchains without requiring adoption by all/other validator nodes

Block-STM has been integrated within the Diem blockchain core (https://github.com/diem/) and evaluated on synthetic transaction workloads, yielding over 17x speedup on 32 cores under low/modest contention.

Via this site