Risks of Network Partitioning When a Split Brain Creates a Security Flaw

I'm looking to create a high-availability, scalable networking solution by using a distributed system of data. A node here, describes a network that has control over one copy of the data. These nodes might contain more than one machine but has one copy of the data.

The nodes will contain data records which can be in a spent state or an unspent state. A client can request a transition for a record to go from an unspent state to a spent state (a request to spend). There is a security risk if they can successfully do this more than once.

A single node, if it has a connection to all other nodes, can tell the nodes that a spend has been requested and can ensure no other nodes want to access the data and that the spend has not occurred already. The node can change the state of the data to spent and other nodes will not do this since they know one of the nodes is updating it and processing the spend. All nodes will change the data, so the record is in the spent state.

If a node cannot reach another node, it can assume the other node is down and will continue operating with the other nodes until the other node comes back up. In this case the node will send all updates to the node that came back up. If this failed node was in the middle of a spend operation that was incomplete at the time, it can complete it then. This would cause minor downtime for some operations. This would be in the case where a node tells the other nodes it will spend and then fails before it can complete the spend process. In this case the other nodes are blocked from updating it so the failed node needs to come back online before it can be completed.

The problem is, the processing for the spend can only happen once. If the network was partitioned, an attacker knowing this could request the spend on one partition and also on the other. Each partition of the network would assume the other to be down and so would operate independently. This could cause the spend to be processed more than once.

This would not be a problem if the request to the two sides of the network was not being made during the time they were partitioned. The network would become eventually consistent when the connections are re-established. If an attack was successful, the nodes would learn about the attack when they re-establish connections because two sides of the network would announce the same change.

So it is detectable attack but is it practically possible?

An attacker would need to be deliberately trying to do this. The software is not designed to make several spend requests at once. There is a time cost to the attack. If the attacker fails, it will take time before they can recreate an unspent record. Creating unspent records requires money. And more money will need to be used in a single attack to get a higher benefit. The reason there is a time cost, is that it would take time to receive the money back to try again. They could afford many smaller attacks and then the benefit to them would be less and the damaged caused, less too.

Surely partitions are so rare naturally, that an attacker would have to be ridiculously lucky to win, if attempting attacks at any time?

If a connection is naturally lost, a node can halt all operations and try a reconnection. Using a low timeout for the connection to the node means it doesn't have to cause any downtime (Perhaps only rare increased latency). If the reconnection fails then it will continue trying but then restart operations (assuming the node is down). Would something like that protect against occasional connection errors?

So would an attacker be able to detect/cause a partition in the network? How likely is it that partitions will occur and for how long? What ways can issues be resolved if possible?

Thank you.


Solution 1:

Having dealt with similar issues in Clustering scenarios, I'm familiar with the situation you describe. Such systems frequently have the concept of a quorum, which is why such systems require an odd number of member nodes. The quorum is used to determine the majority and minority partitions.

The quorum is the number, greater than half, that defines what is the minimum number of available nodes that needs to be present to provide services. If a network partition happens only one partition will have quorum, the other stop services until the partition goes away. If a multiple partition event happens it can lead to no services being provided at all. However, it does guarantee only-one node is serving, and that's how consistency is provided.

As for the likelihood of a partition, that depends on your infrastructure and how your nodes are communicating availability state to each other.

As for their ability to detect a partition event, that depends on your code. The main thing that would make such an attack possible is if both partitions are independently addressable during a partition, which may not be the case. In my experience, network partitions frequently exclude end-users from one partition as well as the other nodes. If the partitions are not addressable, then this attack is a lot less likely to succeed.

Solution 2:

Distributed storage works best with single origin of data being replicated every n seconds, using e.g. SQL index and using replication rules, were to push it. Also central memory "SQL" to control the states.

So simply, when you change object state, this is being communicated to the origin node, and the transaction is performed in SQL with the lock on the record.

If node cannot reach the origin at the time, the operation must fail, as the origin state is only on origin server.

This is like origin-edge workflow, were origin has "memory" - states, and edge "content" - objects.

Is it theoretically impossible to bypass the above model of the edge and central memory while preserving the security and do it in a simple manner. The above model is the most efficient and most correct, and fuzzing it just makes life difficult.