Understanding and selecting databases in NoSQL DBMS [Part2] - CAP theorem
I have analyzed NoSQL DBMS types, this part 2 I will go to the CAP theorem - the theorem that people often refer to when talking about NoSQL DBMS and it is an important criterion to select databases.
But, I will reveal a fact: The CAP theorem is associated with distributed systems, and distributed data stores.
Your thinking: I often hear that NoSQL DBMS is associated with this theorem, what does it have to do with the distributed system here?
The reason is that NoSQL DBMS is built horizontally, which means that the database will contain replicas and the database itself is also spread out into many pieces (horizontal partition), similar to a distributed system. I will have a talk about the structure of NoSQL DBMS for you to better understand. Going back to this article, first, you should know a little bit about replication as well as distributed systems (I will write later) to easily understand the part I'm about to say!
The 3 guarantees
It can also be considered "linearizability".
A state where all data on the nodes are identical. Each "node" here is understood as a "slave". For example, as shown below:
At that time, the football match between Germany and Argentina ended with a score of 1-0 in favor of Germany. So, the Referee executes a statement that inserts the result of this ball match into the database.
After successfully being inserted into the Master(Leader in the picture), slaves also change. But the slaves do not have to be updated successfully at a time but will differ in time. This depends on network latency, network bandwidth, data transfer, the number of messages on the queue, the number of threads, and other configurations of the slaves.
Leading to at the same time, Slave 1(Follower 1 in the picture) has updated data, but Slave 2(Follower 2 in the picture) has not. Accidentally Alice and Bob request these 2 slaves and get different results. The consequences depend on the system and the business, in this case, it will not be very serious, a little delay like this is acceptable. This is still acceptable like when you watch football live in the room, Ronaldo just dribbled the ball to the middle of the field but heard the cheers and howls from the pub, the delay is not less than 30 seconds.
But keep an eye out for government systems involved requiring emergency legislation, healthcare systems, education systems requiring remote medical imaging, and live streaming. They require very short delays.
We have: At the same moment, two nodes give two different, inconsistent results, and the system does not satisfy the consistency (C) in the CAP theorem.
Every client gets a response, regardless of whether there are certain nodes in the system that have problems.
For example, my system has 3 replicas and 1 replica has a problem. But when a client sends a request to the server, there is still a response, because there are still other replica databases to handle this. This is also one of the biggest goals and benefits of replication.
The system continues to work (read/write), even if some nodes in the system have problems.
For example, my system has 3 replicas.
There is 1 replica that has a communication network failure, and cannot communicate with another replica, but it can still read/write normally and so does the replica.
Another possible scenario is that the node encounters errors such as the database server being dropped or being reset, a hardware problem, an application failure, a virus spreading, or possibly running out of disk space. Lead this node stop working, but the other nodes still work and the system still works.
Only own 2 out of 3 guarantees
Surely you all know that with the principle of CAP theorem you can get only 2 out of 3 properties. But, is there really no way to have all 3? Let's see!
- Firstly, suppose we choose P: Our system has many nodes, when one node can't communicate, the other nodes can still use it, so the system can still work.
- Secondly, let's choose A! Fine, when a node has a problem, our system can still respond to the client data, right? So now we have AP.
- Then, we want to have C! When a node has problems such as not being able to communicate with other nodes but still reading/writing normally, the data between nodes is not consistent. How can we be consistent when we can't communicate with each other, can't transmit and receive each other's new data? So we don't get
- Back, we have C, next we choose A or P:
- Choose P: when a node cannot communicate with another node, but we still want the data on all nodes to be consistent to satisfy C, the only way is to suspend read/write all over the system. But now, writing new data in will not be consistent right away! So we get CP, but if we do, the system won't be available (A). So we only have CP.
- And if we still want C, our system only has 1 node. Now calculate availability (A) will show! But now, our system has only 1 node, which does not satisfy P. So we only have CA.
Therefore, we cannot simultaneously satisfy all three properties of the CAP theorem on a system. Unless the CAP theorem changes, or we understand the CAP theorem in a different way.
Discussing the CAP theorem
a) Compare with Consistency in ACID Attribute
Is the consistency in the CAP theorem the same as the consistency in the ACID properties?
No, absolutely not.
In CAP theorem, every node must ensure data consistency, and in ACID properties, when a transaction completes, the data in the system must be consistent with the state before this transaction executes, keeping the accuracy.
b) Is Consistency in CAP theorem mean being consistent at 100% of the time? Are all nodes consistent all the time?
No, instead, we mean Eventual consistency.
You can check again, CAP theorem does not have a word about latency.
Also, CAP-availability systems can be arbitrarily slow to respond and can still be called availability. However, that is the principle, and when we go to a website and it takes 1 2 minutes to load the page, do we call it "available"?
2. CAP theorem and ACID properties
Referring to these two concepts, many people will immediately say:
"NoSQL DBMS is associated with CAP theorem, and RDBMS is associated with ACID properties."
You can see many such states online.
But need to know, actually, CAP theorem is associated with distributed systems, and ACID properties are associated with transactions.
NoSQL DBMS has a structure like a distributed system, so CAP theorem is attached to it. While RDBMS is very strong on transactions, every RDBMS has transaction support, so the new RDBMS is attached to such ACID properties. Remember, ACID properties are for database transactions, not for RDBMS!
So can NoSQL DBMS have ACID properties? A somewhat silly question. But we can answer: with any NoSQL DBMS that supports the transaction, eg. Neo4j, MongoDB.
Please take a look at the CAP image above! As we know we can only choose 2 out of 3 guarantees in CAP theorem. Partition tolerance, then we will choose, as in our previous analysis, if we don't choose Partition tolerance, then our system is no longer a distributed system, but almost you are choosing RDBMS already. We have to consider whether our system needs Availability over Consistency or Consistency over Availability. And along with that, when falling into a risky situation, we will have a way to handle it to limit the harm.
Even Google, Facebook, or Amazon can only choose 2/3.
Google Spanner chooses CP direction. Downtime is 5.26 minutes per year.
Facebook's HBase chooses CP. Downtime is slightly larger than Google.
Google and Facebook choose Consistency over Availability, according to the opinion "Even if you follow the Availability school, it will not be available 100% of the time, so follow Consistency!"
Amazon's DynamoDB chooses the AP. Amazon wants to own Availability, so it accepts that its Consistency is weak, but Amazon still says its DynamoDB is strong consistency. Because they have handled some ways to alleviate this problem like vector clock and calling application.
4. Facts about CAP
Have you ever looked at the definition of CAP theorem and then been confused between Availability and Partition tolerance? In fact, Availability and Partition tolerance are not really 100% separate but step on top of each other. There has been a lot of debate about the CAP theorem before, arguing that it is not correct. Some even think that the PIE theorem should be replaced.
After all, it is not advisable to focus too much on the CAP theorem. It's still true, but a theorem is just a theorem. If we keep focusing on this theorem and leading to ignoring other important problems, we will fail immediately. Should consider broadly having trade-offs and appropriate solutions.
"You shouldn’t really care about the CAP theorem." - This sentence is from a Google engineer.
I don't mean it like that, we don't give up completely, but don't focus too much, we will make mistakes that I have analyzed above.
Summary of the series
In Part 1, I talked about how NoSQL DBMS types work, in this part 2, I talk about how to handle NoSQL DBMS based on CAP theorem (you can review the picture of CAP theorem at the beginning of the article to see 1 DBMS I have. If you are interested in the type, you will know the handling cases of that DBMS). Hope you have an overview of every NoSQL DBMS, as well as an insight into the database system you are working with.
Further, if you choose a database, you have to consider many issues:
- Business of the whole system will be like? Will there be any "difficult" cases? Depending on the business, it will have its own essential requirements, sometimes it requires consistency, sometimes availability, sometimes performance,. ., then we will base on that to choose the appropriate model and type of CAP-A or CAP-C. And also the question is, in the future, under what circumstances will the system change, is our DBMS suitable? Will our DBMS support scaling okay if there are more customers, and more data?
- Actually (in production) what will your system look like? What configuration and budget will our server require? Where will our system build, worldwide or locally, to imagine the architecture of the system? Next, does the budget allow us to own powerful servers? This in turn affects the DBMS we choose.
- Is the DBMS Open-source or Proprietary? This is related to the cost of the license and the level of reliability and support level of the DBMS.
- How does the DBMS require Server OS? For example, there are some DBMSs that can only be used on Windows but not on Linux, and then some require installing middleware or a separate platform, then we have to pay attention to see. Is there a conflict or other difficulty?
- What programming languages does the DBMS support? For example, if you do a project with C# language, but the DBMS you choose only supports C++ and Go, it's difficult. There are still libraries that bind APIs between C# and C++ for themselves, but need to be carefully considered because often these libraries are open-source, not sure what will happen tomorrow, for example, there are APIs that the DBMS changes or add new, sure what library has timely support?
- Which website to refer to? The DB Engine page can show the required server OS and allow licenses. Very intuitive and quite complete!
There are many more, but for the time being, I will only list the things that I find most worthy of attention here. Above all, we must determine what our system really needs and what conditions allow it, in order to be able to decide, to make the right trade-offs.
I will stop this series here. Hopefully, this series will help you understand more deeply and have a multi-dimensional view of NoSQL DBMS! If there are any shortcomings, please comment to help me improve them! Best regards, and see you soon!