Gensyn - A SETI@home-like protocol for distributed foundation model training

Gensyn is a company that was recently funded by a16z, and it caught my attention. I read the litepaper and wanted to explore this a bit further with my good friend ChatGPT. The initial prompt below is just a copy paste from the relevant section in the litepaper.

I hope you enjoy this conversation.

Suhas

Listen, I’m trying to create a distributed world computer, similar to SETI@home, but for training neural networks. I call this Gensyn.

Gensyn Protocol

The Gensyn Protocol is a layer-1 trustless protocol for deep learning computation that directly and immediately rewards supply-side participants for pledging their compute time to the network and performing ML tasks. The protocol requires no administrative overseer or legal enforcement, rather facilitating task distribution and payments programmatically through smart contracts. As described above, the fundamental challenge in building this network is the verification of completed ML work. This is a highly complex problem that sits at the intersection of complexity theory, game theory, cryptography, and optimisation.

A simple solution is to check the honesty of workers by re-doing their work. At a bare minimum, this requires a doubling of the operations required (‘single replication’); however, even with replication, the issue of trust remains unless the verifying party is the actual requestor of the work (in which case, they wouldn’t request the work as they’d simply perform it themselves). Therefore, ensuring the honesty of the verifying party can generate an infinite chain of replication, where each new verifier is required to check the work of the previous verifier. We solve this verification problem by interlocking three key concepts into a robust solution that is more efficient than existing best-practice replication methods; in doing so, it solves the infinite-chain problem.

The key concepts are:

Probabilistic proof-of-learning: Following Jia et al. (2021), we use the metadata from gradient-based optimisation processes to construct certificates of work performed, which can be verified quickly through replication of certain stages.

Graph-based pinpoint protocol: Following Zheng et al. (2021), we use a multi-granular, graph-based pinpoint protocol and cross-evaluator consistent execution to allow verification work to be re-run and compared for consistency, and ultimately confirmed by the chain itself.

Truebit-style incentive game: Following Teutsch and Reitwießner (2019), we use staking and slashing to construct an incentive game ensuring each financially-rational participant behaves honestly and performs their intended tasks.

Can you comment on this, and pair with me to explore further? Perhaps even an alternate version? The key problem as you might’ve realised is how to trust nodes if they’ve really trained honestly or not.

GPT-4

What you’re attempting to achieve with the Gensyn Protocol is fascinating, and indeed, a complex task. It incorporates several advanced concepts, namely probabilistic proof-of-learning, a graph-based pinpoint protocol, and a Truebit-style incentive game. Let’s further delve into these concepts and possibly identify potential improvements or alternate methods.

Probabilistic proof-of-learning:

This is indeed a very smart concept. By creating a ‘certificate of work performed’ from metadata obtained from the gradient-based optimization process, you can provide proof that work has been done without having to replicate the entire process.

However, one potential challenge might arise due to differences in the hardware used by the nodes in the distributed network. It is likely that not all nodes will have similar hardware, and this could introduce variations in the computational results. This challenge could be addressed by adopting a similar approach used by Ethereum’s ETHash or Filecoin’s Proof-of-Replication, where a memory-hard hash function or proof of uniquely encoding the data, respectively, is used to normalize the computation across different hardware.

Graph-based pinpoint protocol:

This approach makes a lot of sense for reducing the burden of verification. By allowing the work to be re-run at various points and then compared for consistency, you avoid having to verify the entire computation. It also adds a level of redundancy that can help ensure that if one node fails, the work can still be completed by other nodes.

As for improvements, in order to further reduce the verification burden, you could consider employing graph partitioning techniques. By dividing the computational graph into subgraphs, you can distribute the verification task among several nodes, thereby speeding up the verification process. Additionally, this could also help make the network more robust by adding additional layers of redundancy.

Truebit-style incentive game:

The use of a staking and slashing incentive scheme is a proven method to ensure network participants behave honestly. By forcing participants to ‘put their money where their mouth is’, so to speak, you incentivize them to perform their tasks correctly.

One potential modification to this scheme would be to consider a tiered staking system, where nodes that consistently prove their honesty and reliability can stake less over time. This would act as a reward for consistently trustworthy nodes and provide an additional incentive for nodes to behave honestly.

Moreover, it’s crucial to find the right balance in terms of the severity of slashing for incorrect results. Too much slashing may discourage participation, while too little may not provide enough of a deterrent for dishonest behaviour.

Alternative methods:

As you mentioned, the critical challenge in this protocol is the verification of completed ML work. One potential solution to explore could be the use of Zero-Knowledge Proofs (ZKPs), which allow one party to prove to another that they know a value (in this case, the correct output of a computation), without conveying any information apart from the fact they know the value.

Another potential direction to explore could be Federated Learning. While this is more commonly used for privacy-preservation, it could also be adapted for your purposes. By having each node contribute a model update (rather than a final model), and then aggregating these updates, you could potentially circumvent the need for full verification of each individual computation.

To conclude, the Gensyn Protocol indeed seems to have a solid foundation with the right key concepts in place. The alternative methods and improvements suggested above are intended to augment your already strong approach. Best of luck with your efforts!

Suhas

Can you come up with a simple algorithm (pseudocode) that would use federated learning techniques and train a transformer model (e.g. an LLM) across 100 nodes?

GPT-4

Here’s a simple pseudocode for Federated Learning with a Transformer model across 100 nodes. For simplicity, let’s assume that the model architecture and training parameters are the same for all nodes.

This algorithm will follow a few steps:

Initialize the model in a central server.
Distribute the model to each node.
Each node trains the model on its own data.
Nodes send their model updates back to the server.
Server aggregates updates and updates the global model.
Repeat steps 2-5 for a given number of rounds.

Initialize global_model on server

for round in total_rounds:

    Send global_model to all nodes

    for node in nodes:
        // Each node trains the model on its own data
        local_model = clone(global_model)
        for epoch in total_epochs:
            for batch in local_data:
                Run forward pass with local_model on batch
                Compute loss
                Run backward pass and compute gradients
                Update local_model parameters with optimizer

        // Calculate the difference between local and global model
        diff_model = local_model - global_model

        // Send the diff_model to the server
        Send diff_model to server

    // Server aggregates the model updates
    for node_diff in received_diffs:
        global_model += learning_rate * node_diff

This is a basic version of Federated Learning algorithm. In real-world applications, several optimizations and additional steps would be required. These might include techniques for secure aggregation of updates, methods to deal with non-IID data distribution, model compression techniques to reduce the communication overhead, and strategies for mitigating the effects of stragglers or dropped nodes.

Suhas

How would you figure out bad actors in this federated approach? What if I’m a bad actor and pollute my local model and send it to the server?

GPT-4

Great question. In federated learning, mitigating the impact of bad actors who might want to corrupt the learning process is a significant challenge. This can be done through a few different techniques:

Model Differential Privacy: This involves adding noise to the model updates or gradients, which makes it difficult for a bad actor to significantly impact the overall model since their malicious updates will likely be drowned out by the noise. This can help ensure that the model isn’t excessively influenced by any one participant.
Secure Aggregation: This process involves aggregating the model updates in a secure manner so that the central server only sees the sum of all updates, and not individual updates. If secure aggregation is done properly, a malicious update would need to be significantly larger than the combined honest updates in order to significantly influence the final model.
Robust Aggregation Methods: Instead of using standard mean for aggregation, one could use robust statistics like the median, which is less sensitive to outliers (i.e., bad actors). There is also recent research on Byzantine-resilient gradient aggregation methods.
Anomaly Detection: This involves continually monitoring the updates from each participant and if an update falls outside the norm (e.g., has too large a magnitude or causes a drastic change in model performance), the update could be rejected, or the participant could be temporarily banned from further updates.

Here’s how the pseudocode might look like with bad actor mitigation (via anomaly detection) included:

Initialize global_model on server

for round in total_rounds:

    Send global_model to all nodes

    for node in nodes:
        local_model = clone(global_model)
        for epoch in total_epochs:
            for batch in local_data:
                Run forward pass with local_model on batch
                Compute loss
                Run backward pass and compute gradients
                Update local_model parameters with optimizer

        diff_model = local_model - global_model

        if IsAnomaly(diff_model):
            HandleBadActor(node)
            continue

        Send diff_model to server

    for node_diff in received_diffs:
        global_model += learning_rate * node_diff

The IsAnomaly function checks if the model update (i.e., diff_model) is an anomaly or not. The HandleBadActor function could be designed to deal with detected anomalies in a variety of ways, such as flagging the node, reducing its future weight in aggregation, or temporarily banning it from the network.

Suhas

Interesting. Let’s go back to the ZKP approach. I’d like you to do the same - give me an algorithm with pseudocode to train an LLM over 100 nodes ensuring node honesty

GPT-4

Incorporating Zero-Knowledge Proofs (ZKPs) into the distributed training process can ensure that all nodes perform the training correctly and honestly, without needing to explicitly trust each node. Below is a simplified pseudocode of how such a system might work:

Initialize global_model on server

for round in total_rounds:

    Send global_model to all nodes

    for node in nodes:
        // Clone the global model
        local_model = clone(global_model)

        // Train the local model
        for epoch in total_epochs:
            for batch in local_data:
                Run forward pass with local_model on batch
                Compute loss
                Run backward pass and compute gradients
                Update local_model parameters with optimizer

        // Calculate the difference between local and global model
        diff_model = local_model - global_model

        // Generate a zero-knowledge proof of correct model training
        zk_proof = GenerateZKProof(local_model, global_model, diff_model)

        // Send the diff_model and zk_proof to the server
        Send(diff_model, zk_proof) to server

    // Server verifies the model updates
    for (diff_model, zk_proof) in received_updates:
        if VerifyZKProof(zk_proof, diff_model):
            global_model += learning_rate * diff_model
        else:
            HandleBadActor(node)

Here, the GenerateZKProof function generates a Zero-Knowledge Proof that verifies that the diff_model was correctly generated by performing the training process on the local_model. The VerifyZKProof function checks this proof. If the proof is invalid (i.e., the node didn’t perform the training process correctly), then the HandleBadActor function is called.

It’s important to note that constructing the ZKP for this task (i.e., ensuring correct model training) and designing a verification process that doesn’t require the verifier to run the entire training process itself is a very nontrivial task, and it would likely require a custom ZKP scheme. It’s also worth noting that this ZKP approach is more computationally intensive than the federated learning approach we discussed earlier. This pseudocode doesn’t address these challenges, but they’d need to be overcome for this approach to be viable in practice.

In addition, in a real-world setting, additional considerations such as privacy, communication efficiency, and robustness to faults (e.g., dropped nodes) would need to be taken into account when designing the distributed training process.

Suhas

How would you design the GenerateZKProof function for an LLM model? Remember that a neural network that’s large has billions of parameters that might not fit on a local machine…

GPT-4

Designing a Zero-Knowledge Proof (ZKP) scheme for validating that a large language model (LLM) has been trained correctly on a remote node is indeed a complex task, and it involves a fair amount of research in the areas of cryptography, complexity theory, and machine learning.

As you correctly pointed out, the size of an LLM poses a significant challenge. Large language models like GPT-3 have billions of parameters, and producing a ZKP that can verify the correct execution of a computation involving so many parameters without running the computation itself is nontrivial.

One possibility could be to leverage techniques from the field of Privacy-Preserving Machine Learning (PPML), which aims to train models on sensitive data while preserving the privacy of the data. Specifically, Secure Multi-Party Computation (SMPC) and Homomorphic Encryption (HE) are cryptographic techniques that allow computations to be performed on encrypted data. The idea would be to use these techniques to encrypt the model parameters and then perform the training computation on the encrypted parameters. This would result in an encrypted model update, which could be sent to the server along with a ZKP that proves that the update was computed correctly.

The pseudocode for the GenerateZKProof function might look something like this:

function GenerateZKProof(local_model, global_model, diff_model):
    // Encrypt the model parameters
    encrypted_local_model = Encrypt(local_model)
    encrypted_global_model = Encrypt(global_model)

    // Perform the training computation on the encrypted parameters
    encrypted_diff_model = PerformEncryptedTraining(encrypted_local_model, encrypted_global_model)

    // Generate a ZKP that proves the computation was done correctly
    zk_proof = GenerateProof(encrypted_diff_model, encrypted_local_model, encrypted_global_model)

    return zk_proof

Note that this is a very high-level and simplified view of what the function might look like. Designing the Encrypt, PerformEncryptedTraining, and GenerateProof functions would likely require significant research and development effort.

Furthermore, the computational and communication overheads associated with these cryptographic techniques can be quite high, and they would need to be taken into account when designing the system. Techniques for reducing these overheads, such as model compression and efficient ZKP schemes, would likely also need to be incorporated into the system.