SOMA Wordmark

Throughout history, humanity has built a shared understanding of the world. The relentless quest for knowledge has driven us towards abundance and prosperity. As AI systems promise to advance our knowledge even further, they must accurately understand the universe.

Embeddings are what AI models use to capture the meaning of data. Embeddings enable AI to search, generate, and understand information, translating what we know into a language that models can understand.

Creating these embeddings requires substantial energy resources, and concentrates power and knowledge in the hands of a few. Yet, the Internet has shown us that it is possible to democratize access to information and resources. For AI to truly elevate humanity, embeddings must be as openly shared as the data on the Internet.

S❍MA is a shared, self-improving pool of embedding models. The system is founded on these principles:

  1. Open: The benefits of AI’s knowledge base should be accessible and fairly shared.
  2. Continual: AI should learn and improve at the speed of the Internet’s growth.
  3. Democratized: Open access to embeddings will foster better research, decentralize power, and encourage a diverse range of people to bring new ideas.

Through a virtuous cycle of more data, better embeddings, and stronger incentives, S❍MA aims to produce abundant knowledge for humanity.

Let’s get started.

GET STARTED

Using the Network

Competing for ML Rewards

Running a Validator

Embeddings

Training Encoders

Data Masking

Embedding Consensus

Benchmarks

Data Incentives

Multimodality

Rewards and Fees

Overview

Mission

Soma's mission is to develop a deep understanding of the world by generating the most versatile and effective general-purpose embeddings for a wide range of data types.

Embeddings as Foundations

Embeddings are crucial for many existing machine learning applications. They provide a compact and efficient representation of complex data, enabling tasks such as natural language processing, computer vision, and recommendation systems. By transforming raw data into refined embeddings, these embeddings serve as the building blocks that more specialized advanced AI tasks possible and effective.

Network Objective

The objective of Soma is to generate the most useful general-purpose embeddings for various data types, creating a deep understanding of the world. While the network supports online inference, its primary goal is continuously improve a massive multimodal embedding network. Such that distilled models from Soma outperform previous versions and set new standards in machine learning benchmarks. By transforming raw high-dimensional data into refined embeddings, Soma makes it possible to fine-tune models with minimal compute and energy, supporting a wide range of use cases.

Decentralization

Soma leverages the inherent advantages of decentralization:

Pooling of Resources: Decentralization allows the network to effectively pool compute power, energy, data, and machine learning techniques from a diverse set of contributors. This collective approach enables the creation of networks that would be cost-prohibitively expensive to train otherwise.

Diverse Perspectives and Reduced Bias: The world is inherently decentralized, with data and insights spread across various sources and regions. Soma's network mirrors this natural distribution, ensuring that embeddings are enriched by a wide array of perspectives and contexts. This diversity reduces bias, as there is no central controlling entity, promoting fairness and inclusivity in data representations.

Continuous Improvement through Competition: The decentralized structure of Soma fosters continuous improvement through economic competition and a "survival of the fittest" dynamic. As the network evolves, it self-improves by encouraging the best contributions to rise to the top. By democratizing access to advanced AI capabilities, Soma lowers barriers to entry, allowing more individuals and organizations to benefit from and contribute to the development of high-quality world model.

Architecture

Soma is composed of two major systems: the encoders and the validators. Encoders are isolated machine learning models equipped with webhooks that allow validators to call on them to perform tasks. Validators handle maintaining balances, validating transactions, issuing and calculating network rewards, and verifying the work performed by encoders.

This separation of roles is designed to focus the skill sets of network participants, improve maintainability, and increase modularity. Individuals with the expertise and stake necessary to run validators can do so without needing to be top-tier data scientists. Conversely, data scientists can contribute their skills without the additional work and stake requirements of running a validator.

Encoders

Encoders are machine learning models that generate embeddings from a given piece of data. Encoders are organized into groups called shards, which are selected using a stake-weighted random sample. The "best" embedding produced by the encoders in the shard receives a reward for its quality. Although only the highest-performing encoder is rewarded, all embeddings from the shard contribute to the final result. The final embedding returned to the data submitter is a weighted average based on the differential loss of the embeddings. If an embedding is deemed an outlier, the encoder responsible is slashed, and the proposed embedding is disqualified. The lowest-performing encoder models receive a minor penalty, which, though severe, creates a strong incentive for improvement or departure from the network. The reward and slashing system is designed to meritocratically enhance performance and eliminate unfit encoder models.

Validators

Validators form the core infrastructure responsible for handling transactions, managing account balances, performing staking operations, and bridging between encoders and data submitters. They use a directed acyclic graph (DAG) method to achieve high throughput and fast finality, prioritizing speed and efficiency over atomic ordering of events. This results in lower latency between transaction submission and finality. For more detailed information, refer to the Consensus section.

Embedding Process

  1. A new piece of data is submitted to the network for a fee.
  2. The network self-selects a shard of encoders to individually embed the data.
  3. Their embeddings are aggregated with a performance-weighted average that ensembles features from the different encoders while ensuring high semantic integrity.
  4. The individual embedding with the best performance is rewarded for its work. Poor performing encoders are penalized.
  5. Shard selection becomes increasingly intelligent in routing submitted data to specialized encoders that perform best in their respective section of the embedding space.

Embeddings

Embeddings are low-dimensional vector representations of data. Embeddings make it easier to do machine learning on large inputs, including text, images, and videos.

The distance between two embeddings measure the semantic similarity of the underlying data. Small distances indicate similarity, while large distances indicate dissimilarity.

Position (distance and direction) in the space can encode semantics in a good embedding. For example, the following visualizations of real embeddings show geometrical relationships that capture semantic relations.

Embedding Relations

An embedding can be learned and reused across models, transferring knowledge from one model to another. Each dimension of the embedding expresses a unique feature about the data that can be used for a variety of applications.

In search, the goal is to retrieve semantically similar data given an input. A variety of applications can be supported in this way such as semantic search, answering questions, or summarization.

To perform a search, we embed the query, which may be natural language. Then we calculate cosine similarity between the resulting query embedding and each of the other embeddings. The highest cosine similarity results are most relevant.

Multimodal embeddings can be used to search across different data domains. The text, image, and video embedding vectors are in the same semantic space with the same dimensionality. Therefore, these vectors can be used interchangeably for use cases like searching images by text, or searching video by image.

Embeddings for Classification

Embeddings present an elegant way of predicting a numerical value or a class for a piece of data. Because the semantic information contained within embeddings is high, the prediction is decent even with very few samples.

They can be used to cluster similar data points together. This can make it easy to find data points in a given category or to identify outliers.

We can even use embeddings for zero shot classification without any labeled training data. For each class, we embed the class name or a short description of the class. To classify some new text in a zero-shot manner, we compare its embedding to all class embeddings and predict the class with the highest similarity.

Embeddings for Recommendations

Because shorter distances between embedding vectors represent greater similarity, embeddings can be useful for recommendation. These recommendations can be based on user behavior, content, or both, for items or advertisements.

Why are embeddings important?

Embeddings enable deep-learning models to understand real-world data more effectively. They simplify how real-world data is represented while preserving the semantic relationships.

The Problem with High-Dimensional Data

  1. Size of Network: Huge input vectors mean a huge number of weights for a neural network. A large number of weights causes further problems:
    • Amount of data. The more weights in your model, the more data you need to train effectively.
    • Amount of computation. The more weights, the more computation required to train and use the model. It's easy to exceed the capabilities of your hardware.
  2. Lack of Meaningful Relations Between Vectors: High-dimensional vectors don't have any inherent structure. The distance between two data vectors doesn't tell you anything about the similarity of the items they represent. This makes it hard for a model to learn about the relationships between items.

The solution to these problems is to use embeddings, which translate large vectors into a lower-dimensional space that preserves semantic relationships.

Reducing Data Dimensionality

Embeddings are used to represent high-dimensional data in a low-dimensional space. Dimensions are typically features or attribute of the data. For example, an image can be considered high-dimensional data because each pixel color value is a separate dimension.

While we want enough dimensions to encode rich semantic relations, we also want an embedding space that is small enough to allow us to train our system more quickly. A useful embedding may be on the order of hundreds of dimensions.

Models require more computational power and time to learn, analyze, and infer from high-dimensional data. Embeddings reduce the number of dimensions by identifying commonalities and patterns between various features. This reduces the computing resources and time required to process raw data, and often helps improve a model's performance.

Training large models

Embeddings improve the quality of training data for models such as large language models (LLMs). Pre-trained models can repurposed by adding new embeddings to transfer new knowledge, without retraining the entire model. This allows a model to be fine-tuned with custom datasets for specific applications.

Creating Embeddings

To create embeddings, a neural network is trained to predict some property of the data. For example, you can train a neural network to predict the next word in a sentence. The weights of the network at the output layer are the embeddings.

When learning a d-dimensional embedding each item is mapped to a point in a d-dimensional space so that the similar items are nearby in this space. This embedding layer can be combined with any other features and hidden layers. As in any deep neural network, the final layer will be the loss that is being optimized.

Movie DNN

For example, let's say we're using a neural network to predict a movie a user would be interested in from the movies other users have watched. Here the input layer would the movies each user has watched, and the output layer would be the movie the user is likely to watch next. The embedding layer would be the d-dimensional space where each movie is represented as a point. The network would learn to predict the embeddings of the movies.

Embedding Layer

The image above helps to illustrate the relationship between the weights learned in the embedding layer and the geometric view. The edge weights between an input node and the nodes in the d-dimensional embedding layer correspond to the coordinate values for each of the d axes.

Shared World Model

The ability for humans to learn and understand the world is currently far beyond the capabilities of machines. We can learn quickly, generalize to new situations, and continuously update our understanding of the world with very little guidance. Machines on the other hand need to be trained on vast amounts of data to learn even the simplest of tasks, and struggle to adapt to new situations without retraining.

Our intelligence may in great part be a result of our capacity to learn world models. We build representations of the world in order to make predictions and plan actions. We constantly update this model as we observe new things. What we call "common sense" is the result of this continuous learning process, and it tells us what is likely, plausible, and impossible in a given situation. Using world models, humans learn new skills with very few trials in a task-independent, unsupervised manner.

World Models

World Model

The role of the world model is:

  1. To estimate missing information about the stae of the world not provided by perception
  2. To predict future states of the world

In order to do this, the world model must have an abstract representation space, or embedding space, that contains all the information necessary to make these predictions. It should also be self-aware of its own limitations and uncertainties, and be able to seek the information it needs to improve its understanding.

Barriers

Large language models have showcased some of these zero-shot and few-shot learning capabilities, but they are still far from the general intelligence of humans.

These models are trained on vast amounts of data and require enormous computational resources to pre-train. Once trained, their model of the world remains static, and queries are required to fit within context windows to update the system with any information it may not have in the moment. Furthermore, they are trained on data that is biased towards a single party's collection and alignment criteria, rather than a more global understanding of the world.

Better World Model

Thus, resource constraints and representation bias are two of the main barriers to creating the best representations for an accurate world model.

A Better World Model

Continuous Learning. A world model should be able to continuously learn and update its understanding of the world. This requires a network that can continuously update its embeddings with new data.

Incentived Resource Pooling. Given its intrinsic value, a world model should acquire the resources driven by the demand for its representations. Networks such as Bitcoin have shown that by incentivizing participants to contribute resources, a decentralized network can be built that accumulates more resources than a centralized one. Given that computational resources are the main bottleneck in training large models, a decentralized network can provide more resources to training a world model than any single entity.

Decentralized Knowledge Acquisition. A world model should be able to acquire knowledge from a diverse set of sources. This requires a network that can accept data from a variety of sources and ensure that the data is not biased towards a single party's collection criteria. The network should also be able to incentivize filling gaps in the world model's understanding.

A Shared Multimodal Embedding Space

By building a network of smaller, diversified, and competing encoders, embeddings can be produced in a way that continously update a shared world model. A shared embedding space is not only useful for a variety of tasks, but also continuously updated with new information and self-improving representations via market demand.

Shared Embedding Space

Training Encoders

Introduction to Encoders and Embeddings

Embeddings represent data in lower-dimensional spaces while preserving essential properties and relationships. Encoders are neural network architectures designed to generate these embeddings. The journey of training encoders has seen significant innovations, each improving the quality and utility of the generated embeddings.

Early Approaches: Autoencoders

Basic Autoencoders

Autoencoders are the earliest and simplest form of encoders. They consist of two parts: an encoder that compresses the input data into a latent space representation, and a decoder that reconstructs the original data from this representation.

Simple Autoencoder

Architecture: Typically, autoencoders have a symmetrical architecture where the encoder and decoder are mirror images of each other.

Training Objective: The primary goal is to minimize the reconstruction loss, usually measured by Mean Squared Error (MSE) or Binary Cross-Entropy (BCE).

Denoising Autoencoders

Denoising autoencoders (DAEs) extend the basic autoencoder by introducing noise to the input data, forcing the encoder to learn more robust features.

Denoising Autoencoder

Noise Introduction: Common techniques include adding Gaussian noise or masking random portions of the input.

Objective: The network learns to reconstruct the original, noise-free data from the corrupted input, enhancing robustness.

Variational Autoencoders (VAEs)

VAEs introduce probabilistic elements to the encoder, allowing it to learn a distribution over the latent space rather than a fixed vector.

Latent Space Regularization: VAEs use Kullback-Leibler divergence to regularize the latent space, ensuring that the learned distribution approximates a prior distribution (e.g., Gaussian).

Applications: VAEs are particularly useful in generative tasks and anomaly detection.

Advanced Techniques

Contrastive Learning

Contrastive learning has gained popularity for learning high-quality embeddings by contrasting positive and negative samples.

Contrastive Pairs

Triplet Loss: One of the earliest methods in this category, triplet loss involves training the encoder to bring an anchor and positive sample closer while pushing the anchor and negative sample apart.

InfoNCE Loss: An improvement over triplet loss, InfoNCE is used in methods like SimCLR and MoCo, leveraging larger batches and multiple negatives for better performance.

Masked Modeling

Masked modeling techniques, inspired by BERT in natural language processing, involve masking parts of the input data and training the encoder to predict the missing parts.

Masked Reconstruction

Masked Language Modeling (MLM): For text, tokens are masked, and the model predicts the original tokens.

Masked Image Modeling (MIM): Applied to images, patches are masked, and the model reconstructs the missing parts.

Objective: These techniques encourage the model to capture contextual information, leading to richer embeddings.

JEPA

JEPA introduces a novel approach by evaluating embeddings in the embedding space rather than in the original data space. Using the embedding space is superior because it tests the understanding of semantic ideas rather than low-level specifics that may be less relevant. Another aspect of JEPA is the use of a secondary predictor model to predict the embeddings for missing pieces of data. The predictor helps improve the embedding model by signaling what features improve prediction.

JEPA

JEPA's predictor model receives the embeddings for the context and predicts the embeddings for the targets.

Masking Input Data

In order to self-improve the shared world model, embeddings must be assessed objectively. This poses the question: what makes a high-quality representation?

Encoders and Masking

In machine learning, masking input data was found to be surprisingly successful in producing high performing models. The original Transformer paper trains its language encoder by masking the tokens at the end of a sequence, prompting the model to learn how to predict the next token.

Masked Language Modeling

The seminal BERT paper showed increasing performance when masked pretraining was applied to not just the end of the sequence, but random parts of the sequence. Models learned not just to predict the future, but to predict any missing data with whatever sequential relationship it could grasp.

The insights from masked language modeling made its way to the visual domain via the Vision Transformer (ViT). By subdividing an image into patches and tasking the model with predicting masked regions of a sequence patches, the network learns both semantic and spatial relationships between different areas of the image. ViT replaced convolutional neural networks as the state-of-the-art architecture for computer vision.

Masked Image Modeling

More recently, generative video models, such as OpenAI's SORA, are trained via masking spatiotemporal "tubelets" from video clips. Interestingly, as the data modality becomes more dense with information, the optimal masking ratios required to train competitive models increases (e.g. ~15% masking in training data for text sequences, ~85% for images, >90% for videos).

Masked Video Modeling

Enshrined Network Masking

The network makes the following assumption: high-quality representations of data capture the semantics of its subcomponents. The performance metric in the following section builds on this assumption through masking.

Prior to embedding, every encoder in a shard masks the input data with a common set of distinct random masks.

Modality-Specific Tokenization and Masking Ratios

For every modality of data, a different mechanism is used to break the data into units of a sequence. The network shares a common and upgradable tokenizing mechanism for each modality:

  • For text, the SentencePiece tokenizer is used. A fixed 15% elements of the tokenized sequence are masked.
  • For images, the images are broken into 16px by 16px patches and flattened. A fixed 85% elements of the tokenized sequence are masked.
  • For videos, clips are broken into 16px by 16px patches extended across 2 subsequent frames and flattened. A fixed 90% elements of the tokenized sequence are masked.

As the network upgrades to support new modalites, each modality will have a corresponding tokenizer.

Deterministic Randomness

After input data is tokenized, a set of deterministic seeds aree passed into a pseudorandom number generator (PRNG) to determine which elements of the sequence an invididual encoder must mask. The seed is combination of:

  • The hash of the last committed block from the previous epoch
  • The hash of the input data
  • The index of the mask in the array of masks

By using a deterministic mechanism, masking can be verifiably correct across encoder in the shard.

Differential Loss

Differential loss leverages the unique property of embeddings to be composable, allowing the decomposition and reconstruction of semantic components within data.

This concept is exemplified by the well-known relationship in Word2Vec:

Embedding("king") - Embedding("man") + Embedding("woman") ≈ Embedding("queen")

Embeddings are powerful because they can capture semantic relationships within data. Differential loss evaluates the embeddings' ability to maintain these relationships when parts of the data are masked. This composability is crucial for understanding subcomponent semantics within the data.

To evaluate embedding performance, we examine the loss by assessing how well certain parts of the data add or remove semantic meaning. Specifically, we measure:

Embedding(full_data) - Embedding(mask) ≈ Embedding(inverse_mask)

Differential Masks

Reducing Cheating with Differential Loss

As the complexity and overlap of masks increase, cheating at differential loss becomes more challenging. The non-linearity of generating embeddings from data or masks further complicates attempts to exploit the system. This complexity ensures a robust evaluation of the model's compositional understanding. The goal of differential loss is to effectively make it such that gradient-based estimation is the easiest way to perform well.

Lower Communication Overhead

Evaluating embeddings in the embedding space requires substantially less bandwidth compared to traditional methods that evaluate performance in the data space. This reduction in communication overhead is particularly advantageous in distributed systems where bandwidth efficiency is critical. The size of a 1024 dimensional embedding is 4kb whereas an image can get into the Gbs.

Objective Performance Measurement

Differential loss allows for an objective, unsupervised measurement of model performance. By evaluating how well a model can decompose and recompose embeddings, we can apply a consistent scoring system without requiring labeled data. Additionally we do not rely on trusted oracles, trusted model-evaluation nodes, or purely averaging subjective responses that ultimately lead to stagnation in performance.

Performance Weighted Averaging

Using the differential loss score as a weighting factor helps boost important features the final embedding. Using an average is beneficial because we can capture the benefits of ensembling responses to improve capacity, and reduce variance. Another benefit of averaging is transfer learning, where the embedding models teach each other slightly. Using a performance score allows the network to learn from the best encoders and can combine features of embeddings that may have been otherwise selected as an outlier.

Outlier Filtration

Having an objective performance score allows us to relax how we treat outliers. While outlier filtering is important for maintaining consensus over the embedding space, it is important to enable more intelligent encoders to shift the embedding space if it results in better performance. Using a naive approach of a simple average would require more aggressive outlier filtering because malicious encoders could shift the final embedding significantly. With the differential loss score, the network can objectively figure out which encoders should shift the embeddings and by how much. Therefore the use of outlier filtering is just to stop obviously wrong embeddings that are far from consensus on where the embedding should be in space.

Local outlier factor (LOR) is calculated for every embedding relative to their peers in a shard. LOR impacts overall score.

Conclusion

Differential loss provides a robust framework for evaluating the composability and semantic understanding of embeddings. By measuring the ability to decompose and recompose data accurately, differential loss offers a unique and effective approach to model evaluation, reducing communication overhead and enabling unsupervised performance measurement. This method's adaptability and efficiency make it a valuable tool in the development and assessment of embedding models in a permissionless environment.

Shard Selection

To enhance network throughput, minimize communication overhead, and reduce redundant computation, the network selects smaller shards of encoders to generate embeddings for each data piece.

Deterministic Shard Selection

Using a network-defined random seed, validators can perform a weighted sample from all nodes based on their stake (own stake + delegated stake). Each epoch, the network reaches consensus on the participating encoders and updates the cumulative stake. With the updated stake list, the random seed, and a hash of the data, the shard for a data piece is determined deterministically and asynchronously.

Consensus on Final Embedding

A quorum of 90% of the shard is required to return an embedding for the data. This consensus method ensures that a substantial majority within the shard must agree, providing robustness against dishonest nodes.

Monte Carlo Simulations

Monte Carlo simulations showed that using this method results in satisfactory low probabilities of dishonest majorities across all tests.

Calculating Shard Size

The hypergeometric distribution can be used to calculate the shard sizes from a given security budget (probability of dishonest majority) and total number of encoders. While the hypergeometric distribution does not capture the stake distribution, it works for an initial estimation. In the future, shard size calculation may also factor in stake distribution.

Security Budget Estimation

Utilizing a Poisson distribution, it is possible to estimate the time between events at various security budgets based on the theoretical max embeddings per second. Since encoder shards do not control balances in a significant way, a mean and median time between events of a couple years is enough of a deterent given that poor-performing encoders are slashed.

Key Takeaway

Shard Sizes

The network's sharding approach allows it to scale out horizontally. Because shard sizes statistically scale at a much slower rate relative to the total number of nodes, the network can efficiently handle an increasing number of nodes while maintaining security and performance.

Market of Experts

Differential loss allows us to objectively measure the performance of specific models. Using shards of models we are able to introduce redundancy, align embeddings between models, leverage the advantages of ensembling, and transfer learn between models.

However, there are physical constraints to the individual models that can be run. GPT4 is rumored to be 1.76 trillion parameters. Running a 1.76 trillion parameter model is expensive and we cannot assume that every participant will run a large parameter model.

Quadrillion Parameter Question?

How do we increase the parameter capacity of the network as a whole without being physically limited to the max model size that one encoder can run?

The answer is market forces.

Mixture of Experts (MOE)

Most modern large language models, leverage a technique called Mixture of Experts to scale their model capacity. (source for MoE) While the practical implementation can vary, the core idea is to route the data to a smaller part of the network that is knowledgable about that area, the "expert."

MOE

The MOE architecture relies on a learned gating function. The gating function acts as a router, passing the input signal to specific experts. To tune the gating function, the model is working to minimize its overall loss, but additional methods can be employed to encourage specialization.

Market Forces

Given the permissionless nature of the network, we optimize through incentives.

Remember, when embeddings are generated by a shard, there is only one model that wins the reward for performing the work, and the models inside of a given shard are selected randomly.

The key is to introduce the ability to ask another model for help via proxy. If the proxy model wins the reward, the person who proxied to it splits the fees. If a model has received a piece of data that they know they will perform poorly at, they are economically incentivized to route that data to a model that they believe will win.

A simple way to think of this is calling a friend on a game show.

By introducing this simple concept of a proxy, we get two new dynamics to improve network performance:

  1. Specialization
  2. Competition

Encoder models are now economically encouraged to become good at specific areas in the embedding space that offer a high return with lower competition. Furthermore, if a certain area of the embedding space has high demand (a lot of data passes through that area), there are now incentives to compete to improve performance and win rewards.

Routing

How will a model selected in a shard for a piece of data know whether they can do well or not on a piece of data? (wording) Quite simply they try to encode the data into an embedding. Given the generated embedding, they can look up historically which model typically wins in that area of the embedding space and the average differential loss they have. If they beat the historical models, then they should submit the data. If they lose, they should proxy to one of those winning models.

Latency

A proxied model can only participate on behalf of one person in the shard. There is then a race to contact the best proxy model and claim them for work. Because the slowest models (or proxies) in a shard get slashed, there is a strong incentive to either do the work or proxy as quickly as possible such that the chosen model can deliver an embedding before the slowest models.

Calling Encoders

Initial Transaction

The individual submitting data to the network for encoding submits a special transaction that contains the entire fee for performing encoding. This fee is dependent on the modality and size of the dataset. To link to a dataset without risking DDOS attacks from external parties attempting to censor the data submitter, a hash of (dataset hash, submitting address) is included in the transaction. This allows called encoders to quickly verify that they should participate in encoding when provided with the data hash while keeping it hidden from the rest of the network.

Data Transfer

The individual submitting the data then contacts all the nodes in the shard (a random but deterministic subset of the network) based on the data hash plus some other network randomness. The shard selection is deterministic based on round number, and stake for the epoch. Every round the shard changes for a given data hash. The submitter provides the transaction hash and data hash which acts as a pre-image to prove to the validator that they should perform the work. This message is signed.

With this signed pre-image after a predetermined number of rounds, this can be submitted to the network to claim the data submitters funds and split between encoders.

The data is then transferred to the encoder and hashed to ensure that it matches the data hash. If it does not match, the encoder ceases work and notifies the other encoders in their shard.

By notifying the other encoders, prior to commit selection they are able to reduce being slashed for latency. If they do end up getting selected and it comes to light that they have committed while saying they would not commit, they are heavily slashed.

Commit

If the data checks out, the encoder then starts executing the encoder model on the data. The resultant embeddings are hashed to create a “commit” hash. The commit hash is returned to the individual submitting the transaction.

Selection

To progress to the reveal stage, the quorum of encoders must be met which is between 80-90% of the shard. The data submitter, must sign off on the selected quorum of commits and corresponding encoders. Typical flow for this is to select the first nodes that respond which incentivizes low latency, but selection is left entirely to the person paying the network. We want to incentivize the lowest latency response because we will evaluate the quality of the embeddings later using a different method.

It is important to note, due to market of experts routing mechanism, it is possible that other encoders are participating in the game, unbeknownst to the data submitter.

At this point, the not selected encoders are slashed unless they proactively notified the rest of the encoders that they opted out.

In the case that the leader goes offline, before making a selection, after a pre-determined number of blocks the encoders can split the encoding fee evenly across the shard. While there might be an incentive for all encoders to ‘not do work’ there is the risk of getting latency slashed for not doing work.

Reveal

The data submitter provides the selection proof (signature of selection) to the selected encoders. The encoders then verify the proof and reveal their embeddings.

The encoders respond with a signed message that includes their embeddings e.g. merkle hash root or something of the nature.

In the case of a mismatch between the commit and reveal values or a lack of response from an encoder results in slashing. This mismatch is proven to the other encoders in the network.

The leader responds with all the received embeddings and associated signatures and a sign off on that bundle. There is an incentive to do this because without the final proof, all of their funds are still liable.

Final Proof

The encoders then evaluate the quality of the embeddings and calculate the final embeddings based on performance. They submit a final proof to the network that specifies all the consensus of the encoders on who won and supplied the highest quality embeddings, who was slashed, etc. Failure to comply results in major slashing for encoders. The encoders use a BLS signature, and provide the final embeddings to the greater network.

Tokenomics

The network's token serves as the foundation for its economic model, ensuring sustainability and participation. This design aims for:

Stable Functionality: Guarantee continuous emissions to support essential operations like staking, data storage, and embedding space optimization.

Meritocratic Ownership: Gradually redistribute ownership to contributors over approximately 20-year cycles.

Predictable Supply: Provide a predictable monetary policy and safeguard against inflation, ensuring the token is a reliable store of value.

Fair Distribution: Implement a small but highly frequent reward system for equitable distribution.

Economic Forces

There are three counteracting economic forces that control the monetary supply:

Minting. Token minting is controlled solely by a predictable mathematical formula that emits a certain amount of tokens at a given period of time. The token minting system has an exponential decay rate, such that more tokens are emitted earlier in the lifecycle of the network, similar to Bitcoin's halvings. The decay rate of emissions is calculated with a target total supply at a certain time, or 90% of the total supply emitted by 10 years of age.

Burning. When tokens are used in the network for staking, transactions, transfers, etc., there is a fixed network burn fee attached to the transaction. This fee is isolated from the cost of generating embeddings which is controlled by the market, similar to how gas fluctuates in other networks. The fixed transaction fee is similar to a tax which can be reused to service the public goods of the network. The burn fee uses a simple algorithm to approximate the sum of all fee annually to be close to 5% of the circulating supply. This means that as network usage increases, the fee per transaction decreases.

Emissions. The network emissions distribute minted tokens, or redistribute tokens that have been burnt, to incentivize the healthy functionality of the network. These rewards are distributed to people that: run validators, delegate stake to validators, submit valuable data, and generate embeddings. The emissions system can be tuned such that the split of rewards is dynamically changed to improve network functionality. By recycling the 5% burned tokens, we can ensure a steady network reward.

Timeline

The objective of the economic systems is to ensure the healthy functioning of the network. The result is a system that becomes predictable in a short timeframe for all parties included.

First 30 Days

Looking at the first 30 days we can see a noticable difference between the network emissions and burn rate.

First Year

Within the first year, the difference between the emissions and network burn rate are just barely noticable.

10 Years

At the 10 year mark, 90% of the supply is in circulation. The difference between emissions and burn is imperceptible.

30 Years

At 30 years, the hard cap is evident at 1e6 (1 million tokens)

Recap

The tokenomics are built to fairly distribute tokens, maintain healthy network functionality, and ensure a dependable monetary supply. We use a recycling method of burning tokens and emitting to ensure that the participants of the network are not subjected to large changes in the system but rather slowly and gradually approach a maintainable end state. As the network transactions scale, per transaction fees decrease. Part of governance is adjusting the emissions splits between the various subsystems to optimize stability.

Fees

Base Transaction Fees

The network targets a 5% annual burn of circulating supply. Every network transaction has a base fee that continually adapts such that the amount burned per transaction meets the projected 5% target. The base fee is adjusted every epoch using a control algorithm.

Cost of Generating Embeddings and Submitting Data

The cost of generating an embedding is a free market. When encoders join the network, they submit their cost per byte for each modality that they are participating in. The costs are aggregated across all encoders, ordered, and the 2/3rd percentile value by stake is selected as the stable rate per epoch. At any point, an encoder can submit a transaction to the network that changes their price per byte for any modality. At the epoch boundry, the 2/3rd percentile values are recalculated accounting for the new values and applied for that epoch. Any additional fees paids on top of the base cost of generation acts as a tip.

The "best" encoding model from a shard receives the full fee amount.

Submitting data is identical to the cost required to generate embeddings. The person submitting data must pay in order to receive an embedding. In the case of network rewards issued for that data, the data submitter takes a risk by paying the embedding fee and only receives a network reward if it meets the criteria, this is to mitigate spam.

Cost of Proxying

The cost of Proxying to a model is decided as a free market by the proxied model. We impose no network-level control here, just the mechanism to have a market.

Rewards

Embedding Incentive

Every winner of an embedding competition receives a network reward. The reward amount per win is small and distributed evenly across all shards. The objective here is to incentivize performance while distribute fairly over time. The best data scientists should probabilistically receive the most rewards over time.

Data Submission Incentives

Every round, the top-k number of highest scores for the Proof of Curiosity game receive a small network reward. Again, the objective is to distribute small rewards but frequently and across many participants. We want to incentivize performance but make the distribution of network rewards extremely fair over time. The best data submitters should probabilistically receive the highest amount of rewards over time.

K is defined in Governance Parameters.

Validator Staking Incentive

Validator staking is not tied to performance in anyway. Running a validator offers a stable and predictable stream of network rewards proportional to the cumulative amount of stake. Delegation in validators offers the same return minus whatever commission the validator has set.

Validator Transaction Incentive

Validators receive a small reward for every operation they include inside of a block. This is to encourage validators to produce full blocks.

Slashing

Encoder Performance

Encoders are evalutated on multiple metrics. The lowest-k (what's the k?) encoders based on a composite score (what's the score formula?) of the following three metrics are slashed:

Differential Loss

The total loss from differential loss. Lower loss is better.

Outlier

How much do the encoders' embeddings deviate from the rest of the shard? Submitting an outlier is bad.

Latency

High latency negatively impacts the encoders' scores.

Malicious Validators

In order to get slashed, validators must receive a vote from 2f+1 of the network stake. Validators self-report each other for not producing blocks, or high latency. When 2f+1 of the network stake has voted to slash a validator, every epoch that validator loses an amount of stake at a decay rate.

Proof of Curiosity

While any piece of data can be embedded by the network, not all data is equally valuable in improving the shared world model. It is crucial to incentivize the submission of data that fills gaps in the embedding space, as this data is the most valuable for improving the network's understanding of the world.

In every epoch, network tokens are rewarded for the piece of data that best fills a gap in knowledge in the known embedding space. This incentivizes the embedding of novel data that is not well-represented in the current embedding space, a mechanism called Proof of Curiosity.

Mechanism

Proof of Curiosity

Validators assess whether an embedding fills in a gap in space by finding its nearest neighbors from the historical embeddings, then calculating the average distance between the centroid of those neighbors.

The centroid is the center of the gap being filled, and the average distance between the centroid and the neighbors is the size of the gap.

Scoring

\[d_{nn} = \frac{\sum_{i=0}^N|C - nn_i|}{N} \]

  1. A person submits a piece of data that is transformed into the embedding space.
  2. The embedding is used to approximately calculate the \(k\) nearest neighbors \(nn\).
  3. The neighbors \(nn\) are used to calculate the centroid \(C\) or mean of the cluster.
  4. The average distance \(d_{nn}\) from each neighbor is calculated from the centroid.

To validate that the embedding is not too far from the centroid, we compare it against a cutoff value as follows:

  1. The distance \(d_{E}\) from the embedding to the centroid is calculated.
  2. \(d_{E}\) is checked to see if it is less than or equal to ⅓ of \(d_{nn}\) , otherwise the embedding is disqualified.

This scoring mechanism incentivizes both specialization in areas of the embedding space and exploring new boundaries of knowledge. Approximate nearest neighbors algorithms (such as NGT) can be used to efficiently determine the nearest neighbors for the embedding.

The reward mechanism can be adjusted to proportionally reward the top N scoring embeddings for an epoch to more frequently reward gap-filling data submissions.

Synthetic Data

Aside from randomly guessing, winning the Proof of Curiosity game requires knowing the space of historical embeddings and having the ability to produce a piece of data from a target gap-filling embedding. This may require generating synthetic data, using a diffusion model or other methods.

A perpetual feedback loop of better representations and higher quality gap-filling data should lead to richer embeddings for downstream applications.

Multimodality

The network supports the unification of embeddings in different modalities, including text, images, video, and audio. The idea is that embeddings produced by the network retain modality-specific semantic structure while being endowed by shared semantics across modalities. Every encoder participating in the shared embedding space is responsible for encoding any modality of data, outputting embeddings with the shared space's dimensionality.

Multimodal Encoders

CLIP

The paradigm in multimodal encoders is to pretrain dual encoders on positive pairs (e.g. image-text à la CLIP) and map their individual embeddings to a shared embedding space. The loss function used for training is simply the difference between the individual embeddings of the different modalities in the pair.

ImageBind

ImageBind presents a way of learning a joint embedding space across six different modalities using only pairs of data, rather than inputs from all six modalities. Unseen pairs of modalities can be inferred by the network, and emergent alignment between these modalities is achieved.

Pairing in Proof of Curiosity

Similarly, to guarantee that the network can be trained to understand similarities between modalities, the Proof of Curiosity mechanism rewards additional token incentives for the submission of pairs of different modality data.

The encoders align on two different embeddings, one for each element in the pair, using a differential loss-weighted average. Each embedding is averaged with the other with a modality-specific weighting, endowing cross-modality information to either embedding. To combat invalid pair submissions, if the pair of final embeddings are too far from each other, the pair is disqualified from being rewarded, and the weighted average is not applied. The final embeddings are produced together as output.

The system can be extended to support new modalities in future upgrades.

Delegated Stake

The network relies on stake and delegated stake for security and to deter Sybil attacks. Stake requires the various participants in the network to have "skin in the game".

Since the network is split in functionality between validators and encoders, the staking system is built to encourage a 50:50 ratio between validator and encoder stake.

Staking in Validators

Individuals that stake in their own validators or delegate to validators receive a proportional reward based on stake. As long as the validator does not act maliciously, there is minimal risk of losing stake. However, there is no performance related bonuses, as every validator performs the same required work.

The operators of validators may charge a commission for the service of running a node. The node operator can charge any percentage from 0-100%. The node operators can also change their commission percentage at any point. It is the delegators' responsibility to perform due diligence while selecting validators in which to stake.

The staking system allows for anyone to stake tokens in any address, which creates a stake account for that address. If that address is operating a validator, rewards are distributed to the stake accounts for the given validator address. While it is possible to stake or unstake at any time, the changes do not go into effect until the next epoch change. During the interim period, the stake is ineligible for rewards.

A delegator can allocate their entire stake to validators without allocating to any encoders. This offers a low risk and predictable reward as long as the validator is well-behaved.

Staking in Encoders

Validators receive network rewards proportional to stake. However, encoder rewards and slashing is dependent on performance. The strong performers get richer and the low performers get slashed.

Staking in a encoder is therefore much higher risk, but also rewards higher relative to simply staking in a validator. Additionally, staking in encoders has diminishing returns. This is due to the fact that rewards are not received in proportion to total encoder stake, but rather proportional to the encoder's performance. Staking in a encoder also increases its chance of being selected to produce embeddings. Encoders are slashed for poor performance or a lack of participation (liveliness faults).

Virtual Stake

The rewards generated and dispersed by an Encoder are split with the delegated stakers. Except unlike validator stake, encoder rewards are distributed proprotionally based on virtual stake not the staked amount alone. To encourage staking in both validators and encoders, an account multiplier is applied based on the ratio of total stake in validators vs total stake in encoders.

The account multiplier's max value is 2.0. Why 2? If the 50:50 ratio is maintained between validators and encoders, the delegator gets rewarded with a virtual stake worth the full amount of their locked tokens.

Multiplier Scoring Function

The scoring function is a piece-wise. The multiplier is linear up to a ratio of 1.0, then flat at y=2.0.

def scoring_function(ratio):
    if ratio >= 1.0:
        return 2.0
    else:
        return 2.0 * ratio

Multiplier Gradient

It is in the delegators best interest to rebalance their positons to maintain a 50:50 ratio between validators and encoders. An added benefit of this is slashing and account delegation can happen independently across both validaor and encoder systems with little impact on each other.

Minimum Stake

To further protect the system from Sybil attacks, the network will require a minimum amount of stake in order to join the network. A 100 tokens stake requirement per validator, and 10 tokens per Encoders seems like a reasonable starting point. Minimum stake can be changed via governance if necessary.

Consensus

Network validators participate in consensus to agree on the embedding history, account balances, stake distribution, and the network participants.

Byzantine Fault Tolerance

Byzantine Fault Tolerant (BFT) consensus is a fundamental building block for distributed systems that allows a set of computers to agree on common values, even if some of the computers are faulty. In this context, we desire to reach consensus on the state of the shared embedding space, as well as the ordering of transactions to embed data and allocate rewards. We'd like to do this in a way that minimizes latency while maximizing network throughput for embedding vast amounts of data.

DAG-based Consensus

DAG-based consensus algorithms are a class of BFT consensus algorithms that are particularly well-suited for this task. Validators send messages containing blocks of transactions to each other, with each block containing references to previous blocks that it has seen. This creates a directed acyclic graph (DAG) of blocks, where each block is a vertex and each reference is a directed edge that acts as a vote towards proposed blocks. Each validator can look at its local view of the DAG, containing all the messages it has received, and decide which blocks are valid using the DAG structure alone. This allows the network to reach consensus on the state of the shared embedding space and the ordering of transactions without any additional communication overhead.

Our consensus mechanism is based on the Mysticeti protocol, a state-of-the-art DAG-based consensus algorithm designed for high throughput, fast transaction finality, and low CPU / network overhead. It can support up to 100k transactions per second with sub-second finality, even in the presence of faulty nodes.

Blocks

The consensus protocol operates in rounds. Within every round, a validator proposes a block of fresh transactions and sends it to the network. The other validators receive the block, validate its correctness, and vote on it by referencing it in their own block for the next round. Once a block contains references to at least 2𝑓 + 1 blocks from the previous round (where 𝑓 is the number of faulty nodes), the validator signs it and sends it to other validators. In this case, 2𝑓 + 1 validators represent at least 2/3 of the network.

DAG Patterns

Once blocks are received, the DAG structure must be interpreted to reach consensus on the ordering of blocks and their transactions.

We say a block B in round r is certified if at least 2𝑓 + 1 blocks in the following round r + 1 have B as a reference. A block in any subsequent round (i.e. r + 2) is a certificate for block B if it contains the blocks that certify block B in its history. Therefore, the soonest a block may be certified is 2 rounds after it is proposed.

We say a block B in round r is skipped if at least 2𝑓 + 1 blocks in the following round r + 1 do not have B as a reference. In all subsequent rounds, there is no way for a block to be a certificate for block B, therefore it may be safely skipped.

DAG Patterns

From these two patterns, we can derive conditions on which to commit a sequence of blocks from a node's local view of the DAG.

Block Ordering

In every round, validators assigned to slots. If the DAG's structure allows validators to commit a block, blocks are ordered by slots within rounds. There are two ways in which a proposed block may be committed to the ordered sequence:

  1. Direct Decision Rule
  • A block B is committed if 2𝑓 + 1 certificates are observed for that block. In other words, there are 2𝑓 + 1 different blocks that are linked to 2𝑓 + 1 blocks that have block B as a reference.
  • A block B is skipped if it simply observes the skip pattern. In other words, 2𝑓 + 1 blocks in the following round do not have B as a reference.
  • A block B is undecided if it is neither committed nor skipped. It is ultimately decided using the indirect decision rule.

Direct Decision Rule

  1. Indirect Decision Rule
  • To decide an undecided block, we first search for an anchor. This is the first block in round ** r' > r + 2** that is committed.
  • If the anchor references a certificate to B, i.e. it references a block that references 2𝑓 + 1 blocks that reference B, we can commit the undecided block.
  • If the anchor does not reference a certificate to B, we must skip it.

Indirect Decision Rule

A validator may incrementally process its local view of the DAG to commit blocks to the ordered sequence, up until it first sees an undecided block. The transactions seen in the committed blocks are considered finalized in the ordering within the sequenced block. This allows the network to reach consensus on the ordering of transactions.

Epochs

Epochs are used to manage the adding and removing of network validators, as well as to manage the network's stake distribution and issue network incentive rewards.

Epoch Changes

Epoch changes start at a predefined commit, dictated by a certain round number from the last committed round from the previous epoch. Once a validator has committed the predefined round number, it stops committing further blocks and tells other validators that it is ready to change epochs. Other validators stop committing blocks upon receiving this message.

Once a validator has received 2f + 1 such messages from other validators, it will perform the following steps:

  1. Increment its epoch number.
  2. Process reconfiguration of the validator set.
  3. Calculate slashing amounts for validators and encoders.
  4. Calculate and distribute rewards to validators, encoders, delegators, and Proof of Curiosity winners.
  5. Start committing blocks again in the new epoch.

Reconfiguration

Adding and removing validators is done through a reconfiguration process at the end of each epoch. During an epoch, candidates may submit a transaction to add or remove themselves from the validator set. In the case of an addition request, if the candidate meets the minimum stake requirement, they are added to the a list of pending additions.

Note that changing an epoch at a predefined commit guarantees that all validators will be synced to the same state before the epoch change. Since pending validator additions and removals are transactions within blocks, this ensures that validators will have the same view of the new epoch's committee.

Once the epoch change is complete, the new validator set is finalized and the new epoch begins.

State Sync

New or recovering validators need a way to synchronize their state with the rest of the network.

Snapshot Sync

By default, validators sync up with the network using snapshots. Snapshots represent the network's current state, attested by nodes with 2f+1 of stake. To synchonize a node, all that must be done is to download the state snapshot, verify the integrity of the data, and load into local storage. Afterwards, the node can catch up to any new transactions by acquring the intermediate state changes since the snapshot's creation.

Generating Snapshots

Chunking State

The network state is composed of user balances and embedding related data. Both can be chunked intelligently to minimize redundant work.

Given that Soma does not support arbitrary smart contracts, user state related to balances remains small, with very controlled bloat. State simply scales linearly with the number of accounts.

To further improve the chunking performance, it is possible to apply grouping of accounts such that state that changes frequently are grouped together in the same chunks. By intelligently grouping state, the validators minimize the hashing and chunk processing required for each incremental update to the state snapshot.

The embedding data can grow to be extremely large in size given the high dimensional space of embeddings and frequency of generation. To deal with this, the network may apply a pruning algorithm that prunes based on time, position in embedding space, or a hybrid of both.

Again, to minimize the work required between state snapshots, the validators can intelligently group state in such a way that chunks of embeddings that are far from being pruned do not need to be reprocessed every time that new embeddings are added or pruned.

Chunking state allows a node that is attempting to synchronize with the network to download and process the chunks of state in parallel. Additionally, given the intelligent chunking algorithm, a node that is recovering only needs to download the chunks that have changed.

Cuckoo Filters

More traditional snapshot approaches (i.e. Merkle proofs) introduce an attack vector from malicious peers that can significantly degrade performance by supplying invalid chunks to nodes. Moreover, the traditional process of creating Merkle proof snapshots leads to performance hiccups from having to search the tree, serialize all nodes, build chunks, and save them to local storage.

To efficiently verify snapshots the network takes inspiration from A Bloom Filter Based Scalable Data Integrity Check Tool for Large-Scale Dataset to use a Bloom filter to verify the checksum of chunks in parallel, with minimal memory usage. Bloom filters are probabilistic data structures that are used to quickly and efficiently check if an element is in a set.

Cuckoo filters are similar to Bloom filters, with the additional feature of deleting items in the set. By leveraging a Cuckoo filter insead of a Bloom filter, it is possible to update snapshots incrementally with high efficiency.

After passing all the hashes of state chunks through a Cuckoo filter, the bitmap of the filter can be hashed, giving us a verifiable Cuckoo filter. A validator that is syncing to a snapshot would only need to:

  1. Download the corresponding Cuckoo filter and verify the hash
  2. Process every incoming chunk using a parallel work stealing pattern

This is far more efficient than waiting for all chunks to be downloaded before verifying and loading them into local storage.

To update to the Cuckoo filter, nodes simply delete the hashes of old state chunks from the filter and add in the new state chunk hashes. This allows for extremely efficient incremental updates to state snapshots.

False Positive Rate

Cuckoo and Bloom filters have false positive rates. The probability of a false positive is controllable, but the risk is non-zero.

To mitigate false positives, the network employs a more traditional cryptographic proof that is computed once the entire snapshot has been downloaded by a syncing node. This allows the chunks to still be processed and loaded into the local database in parallel with an extremely high probability of being valid. However, to reach 100% certainty on the integrity of the data, the final checksum can be computed after all chunks have been downloaded.

The final hash is as simple as concatenating the individual chunk hashes together in a determistic way e.g: hash(sort(chunk_1_hash, chunk_2_hash, chunk_3_hash, ...)). There is no need for redundant hashing given that all the hashes were precomputed from earlier steps.

In the case that malicious chunks becomes extremely problematic, this could be converted to a Merkle tree to efficiently figure out which chunk is incorrect. However, it is likely sufficient to simply increase the size of the Cuckoo filter, which makes it substantially more difficult for a malicious chunk to pass as a false positive.

The Cuckoo filter gives us a high probability (99.999%) of filtering out malicious or corrupted state chunks. The final hash allows us to assure 100% integrity of the synced data.

Syncing Using Snapshots

To sync a new validator:

  1. The operator finds the latest snapshot using peer-to-peer gossip, initiated with a set of trusted entrypoint nodes.
  2. The validator gathers the required proof that a snapshot has been attested by 2f+1 of the stake in the network.
  3. After verification, the validator downloads the snapshot's corresponding Cuckoo filter and checks the hash of the filter's bitmap against the hash in the snapshot.
  4. Using the Cuckoo filter to check incoming chunks, the validator efficiently downloads and processes chunks in parallel until all chunks have been downloaded.
  5. The hashes from each chunk are then concatenated together in a determistic order and hashed to produce the final checksum.

After successfully loading the snapshot, the validator then downloads the additional state diffs from subsequent rounds that happened after the snapshot was taken.

Future Research

It may be possible to effectively snapshot state in an ultra efficient way that allows a state snapshot to be computed on a round by round basis rather than at epoch changes. If so, there may be an efficient way to download only the state that has changed from the last snapshot and apply it to the state chunk more like a delta but allowing the delta to be efficiently calculated across trivial ranges.

e.g.

(New round 100 state chunk) - (old round 90 state chunk) = (10 round state delta)

(10 round state delta) + (old round 90 state chunk) = (New round 100 state chunk)

Archival Sync

Archival sync is a specialized sync that attempts to download the entire history of transactions and embeddings. Archival nodes in addition to storing this information intentionally also organize and serve the archive in an efficient manner for others to download. Archival nodes attempt to sync and verify records in a P2P fashion, but may also be run in a faster version that downloads the entire archive from a storage bucket like S3, GCS, etc and only samples P2P nodes to spot check the historical record.

Data Sync

Supported Signatures

The network supports the following signing schemes. By abstracting the signing method from transaction types, it allows for the future addition of new cryptographic methods, such as quantum-secure signing.

Ed25519

Ed25519 is an elliptic curve signing algorithm using EdDSA and Curve25519. It is known for its speed and efficiency. This scheme provides strong security guarantees and is widely used due to its robust performance.

Multi-sig

Transfers that verify the supplied signatures against the corresponding multi-sig account. N-of-M signatures are required, with a maximum of 32 signatures. The signature scheme is the same as for simple transfers. Multi-sig accounts enhance security by requiring multiple approvals for transactions, making unauthorized transfers more difficult.

BLS Threshold-signature

Transfers that verify a BLS aggregate signature for a corresponding account. BLS signatures allow multiple parties to produce a single signature, which can be verified against the aggregate public key. This scheme is particularly useful for threshold cryptography and decentralized applications.

Transaction Types

Soma does not support generalized smart contracts. Instead, the network opts for efficient transaction types built directly into the network functionality. These transaction types can be extended through decentralized governance.

Transfers

Debit Soma tokens from one address and credit them to another address. Transfers are the fundamental transaction type for moving tokens within the network.

Simple Transfer

A simple peer-to-peer transfer that verifies the transaction signature against the public key of the sender. This is the most straightforward way to transfer tokens between two parties securely.

Timed Conditional Transfers

Standard Soma accounts are controlled by a single signing key. Although that signing key can require multiple signatures (e.g., multi-sig or aggregate BLS), the underlying account is controlled by one key. Certain special accounts can be controlled by two parties. Once either the time condition is met or another condition is fulfilled, the account self-destructs and debits the original owner's account with any remaining tokens.

Hashed Timelock Contract

The initiating party supplies:

  1. Time to live (TTL) for the conditional account
  2. Amount of tokens to supply
  3. A hash pre-image that unlocks the total amount of tokens
  4. The recipient of the tokens

During the TTL period, anyone can submit the hash pre-image to initiate the conditional transfer to the recipient. If the TTL expires, the account self-destructs, and the tokens are transferred back to the account creator. This mechanism ensures secure and conditional fund transfers based on predetermined criteria.

Any signing scheme can be used. Any transfers while the account is alive are added to the amount transferred to the recipient. Since the network directly controls these accounts and reserves a custom namespace, they cannot be transferred to after self-destruction.

Unidirectional Payment Channel

The initiating party supplies:

  1. Time to live (TTL) for the conditional account
  2. Amount of tokens to supply
  3. The recipient of the tokens

During the TTL period, anyone can submit an amount to send along with the corresponding signature(s). The signature(s) must be valid, and the amount must be monotonically increasing. Ordering does not matter because the outcome is monotonically increasing. If duplicates are submitted, no additional funds can be sent; if a higher amount is submitted before a lower amount, the lower amount is invalid. The payment channel allows incremental streaming of funds off-chain. Any signing scheme can be used. After the TTL expires, all remaining tokens are returned, and the account self-destructs. The account can be transferred to while alive, incrementing the total amount. After the TTL expires, the account cannot be transferred to.

Data and Intelligence

The core use case of the network. Data and intelligence transactions facilitate the creation, management, and utilization of data within the network.

Create Data Orders

Unlike generating embeddings, data orders are NOT real-time but function more like an order book for data. Users can place bounties for data that meet specific constraints related to the embedding space, modality, and quantity. This allows for efficient matching of data suppliers and requesters based on predefined criteria.

Generate Embeddings

This takes the sender's supplied data and corresponding costs for generating embeddings and returns the finalized embeddings from the network. Embeddings are used to represent data in a way that is optimized for machine learning and other computational tasks.

Stake Management

Stake management transactions enable control over staking in both validators and encoders. Staking is essential for securing the network and incentivizing participants.

Delegate Stake to Validator or Encoder

Delegate stake to a validator or encoder. Additional delegations will update the existing stake account and increase the quantity. This process supports the stability and performance of the network by empowering reliable validators and encoders.

Remove Delegation from Validator or Encoder

Remove delegation allows for removing a specified amount from the validator or encoder. Additional calls will continue to decrease the amount delegated. This provides flexibility for stakeholders to adjust their investments according to network performance and personal strategies.

Network Management

Controls joining or leaving for validators and encoders. Effective network management ensures the robustness and decentralization of the network.

Register Validator

Adds a validator to the network. The validator must meet minimum stake requirements. This process ensures that only qualified and committed entities participate in validating transactions.

Unregister Validator

Removes a validator voluntarily. Going below the minimum stake or acting maliciously results in hard and soft bans, respectively. This mechanism maintains the integrity and security of the network.

Register Encoder

Adds an encoder to the network. The encoder must meet minimum stake requirements. Encoders contribute to the processing and management of data, enhancing the network's functionality.

Unregister Encoder

Removes an encoder voluntarily. Going below the minimum stake automatically removes the encoder. This ensures that only active and compliant encoders are part of the network.

Governance Parameters

Security Budget

0.00000000001%

The security budget is the probability used by a hypergeometric distribution to estimate encoder shard sizes. Read more on shard selection.

Minimum Stake

Validators: 100 Tokens

Encoders: 10 Tokens

Minimum stake required to participate in the network. Minimum stake helps mitigate sybil attacks. Read more about how those values were derived in: delegated proof of stake.

ML Unknowns

Experiment Design

In designing ML experiments with our centralized game server, we have a number of ML-specific unknowns to consider. To address these unknowns, we will need to design experiment that vary in the following factors:

  • Modality(ies) of input data
  • Model architectures
  • Loss functions
  • Masking strategies
  • Online training dataset
  • Competitor count
  • Pretraining methods / datasets (if any)
  • Benchmark assessment

For the benchmark assessment in particular, it is important for us to consider different ways of evaluating the quality of embeddings produced by our network in comparison to centralized models. It may also require us to develop our own benchmark to account for the unique properties of our network.

Questions

In addition to these experiment parameters, some different questions we have are:

  1. How manipulatable is differential loss as a performance metric vs. the complexity of the masking strategy?
    1. Does outlier filtering help catch cheaters?
  2. How does tokenization for input data work in practice?
    1. How does it differ across modalities?
    2. Is it sufficient to use a client-side method or does something need to be enforced at the protocol level?
  3. Does our method of multimodal embedding production work in practice?
    1. Will people need to run encoders for multiple modalities simulataneously or is there a way to allow single-modality encoders to contribute to multimodal embeddings?
  4. Does differential loss help or hurt the quality of embeddings after pretraining on respective benchmarks?
  5. How will people play the Proof of Curiosity game in practice?
    1. What techniques will they use/
    2. Do these techniques result in "good" data submissions?
  6. How does a model / set of models trained or participating in our system perform on other ML benchmarks?
  7. How does data of different formats / sizes get processed by the network?
    1. How does batching data work to utilize the network efficiently?

Implementation Unknowns

  1. Tokenization
    1. Whether we need it at the protocol level for masking
    2. If we do, what should we use for text, images, video, audio?
  2. Masking Ratios for Modalities
    1. What are the optimal masking ratios for different modalities?
  3. Sharding
    1. How many nodes in a shard?
  4. Market of Experts
    1. Will people actually use this if it's not required?
  5. Tokenomics
    1. What control mechanism will we use for controlling the burn fee?
    2. Rewards
      1. Specific rewards amounts for embedders, validators, and data submissions
    3. Slashing
      1. Validator slashing mechanics
      2. Encoder slashing formula
  6. Multimodality
    1. What is the embedding space's dimensionality?
    2. Does our mechanism for multimodality work?
  7. Consensus
    1. Do we need leaders for Mysticeti?
    2. What state needs to be synced and how do we implement it?
  8. How normal people will use the network
    1. Obtaining the embeddings with requested dimensions (Matryoshka)
  9. How will nodes validate the Proof of Curiosity game?
    1. Vector storage for nearest neighbor search
  10. k in Proof of Curiosity
    1. What is the optimal value for k?

  1. how does negotiating the price of embeddings work?
  2. who calls the encoders from the validator nodes
  3. how do batches of data work
  4. data formats / resizing data
  5. How can we fast sync efficiently while checkpointing state rather and using SPVs and merkle trees?
  6. Practically how do we store balances
  7. how do we update state