Hello there,
How are you doing? Welcome to scaleyourapp.com
This write-up is a comprehensive guide to UUID & GUID. It answers pretty much all our questions on the topic.
What is a UUID? a GUID? What is the difference between them?
Are they really unique? How do big online platforms like GitHub, Facebook, Aadhar generate UUIDs to uniquely identify information?
Has there ever been a UUID collision? What happens when there is a collision?
So, without any further ado. Let’s deep dive into it.
1. What is a UUID?
UUID stands for a universally unique identifier. It’s a 128-bit number which is used for uniquely identifying information in online computing platforms holding a massive amount of information such as GitHub.
The use of UUID is ideally required in systems which register huge amount of data, for instance, a database logging billions of events round the clock or a code repository like GitHub receiving millions of code commits every minute if not seconds.
Online biometric systems deployed by countries use UUIDs as unique identifiers for identifying citizens & their respective data. India’s Aadhar, the biggest & most sophisticated biometric system in the world generates a VID virtual identifier for its citizens to be used for official purposes.
Ohk… now I kind of understand the basic stuff but I’ve also heard about GUID. What is it?
2. What is GUID?
GUID means Globally Unique Identifier & is used synonymously with UUID. It’s kind of the same thing. Just the universal word is replaced by Global. GUID is Microsoft’s implementation of UUID.
This is the only difference. So, we can say UUID is more like a universal generic term whereas GUID is what Microsoft likes to call UUID.
3. Are UUIDs Really Unique?
Well, kind of yes for practical purposes & no if you are thinking there won’t be a UUID collision ever. A collision is possible but the total number of unique keys generated is so large that the possibility of a collision is almost zero.
As per Wikipedia, the number of UUIDs generated to have atleast 1 collision is 2.71 quintillion.
This is equivalent to generating around 1 billion UUIDs per second for about 85 years. Oh Boy!! that’s a really long time. If my software system runs for that long I would already be a zillionaire.
And the file containing these many UUIDs would be larger than any database presently in existence. Of course.
So, now we kinda have an idea if we should use UUIDs for identifying information in our online systems ?
Still, it’s not a foolproof solution. We need to have collision handling checks in place. More discussion on it further in the article.
UUIDs were standardized by the Open Source Foundation. A non-profit organization founded to create an open standard for the implementation of the UNIX operating system.
UUID really shines when we merge separate systems tracking information into one system as a whole. If the systems use them to identify data, after merging the systems, the information would still be unique as a whole.
4. Has There Ever Been A UUID Collision?
As I’ve stated earlier UUID collisions are possible. There have been collisions in the real world.
Refer to this Github post for instance.
It’s a discussion amongst the developers, finding ways to tackle the issue of collisions. Good read I would say.
Though the collisions did occur due to the buggy code not due to pure random chance.
5. What Happens Where There is A UUID Collision?
The repercussions of a UUID collision depends on how critical it is to our system. How badly would it impact the functionality of the system, if it ever happens.
We always need to have an exception handling code in place to smoothly tackle collisions if they ever occur.
Without the checks, we can never be sure how much it can blow up on us. For an instance, if there is an identifier collision amongst the online wire transactions id’s, we can imagine the confusion it’s gonna create.
UUID collisions in systems handling deadly weapons can wreak havoc.
On the contrary collisions in user event log ids in a database might not be that critical.
So, you see it entirely depends on how critical a collision to your system is.
6. Using UUIDs in Large Scale Systems
Using UUID as a Primary Key in the database
1. In case the database is sharded there is no problem in identifying split resources.
No matter how many times the database is sharded the data identifying mechanism remains unaffected.
2. UUIDs are a sequence of randomly generated numbers so it doesn’t give out much information even if the information is leaked. For instance, if a customer id 361 is leaked, perpetrators could figure out the ids of other customers but if the id was a UUID it’s really hard to make sense of things. Though both the things should never be exposed.
3. One downside of using UUID is it’s a 128-bit number, so naturally, it will occupy much more space in memory in comparison to a 32-bit integer value. And when the data is huge the memory consumption by UUIDs becomes quite significant.
So, a good design would be to use a mix of integers & UUIDs wherever it deems fit.
Twitter had a network service called SnowFlake for generating UUIDs at high scale.
7. How Git Creates Unique IDs For The Code Commits?
Git creates a 40 character SHA-1 hash to uniquely identify the code commits. SHA-1 Hash also ensures data integrity of the commits. Any change in the commits also changes the hash value.
Is there a possibility of an SHA-1 hash collision?
Here is what Git has to say about it.
If each & every one of the 6.5 billion humans on earth were programming & every second each one of us was writing code equivalent of the entire Linux Kernel history which is around 6.5 million Git objects. It would take roughly 2 years for a 50% probability of an SHA-1 hash collision.
8. How are UUIDs or GUIDs Generated?
Speaking of a UUID or a GUID generation first thing which we should be clear about is what kind of UUID we wish to generate?
There are several versions of UUID available Version 1, 3, 4 & 5. Also, there are standard ways to generate UUIDs. We might not always need to write our own version.
Version 1
Generating a version one UUID involves the MAC address of the computer and the timestamp. The downside of creating this version is it kind of reveals the information of the computer which created the UUID & the timestamp when it was generated.
There is another variation to this, instead of the real MAC address of the computer a random Multicast MAC address is used.
Version 3
Creating a version three UUID involves creating a UUID with a specified input name.
The input name is MD5 hashed so that the input parameter remains hidden in the generated UUID. So, this kinda identifier generated is not random or dependent on any kind of environment. Thus, it is reproducible.
Version 4
A version 4 UUID is derived entirely from randomly generated numbers. Thus not reproducible it’s safer in comparison to version 1 or 3.
Version 5
A version 5 UUID is similar to version 3. The primary difference is instead of MD5 hashing algorithm SHA-1 algorithm is used as a hashing method.
SHA-1 is believed to be a safer algorithm as opposed to MD5.
There are also several unique identifier implementations provided by well known operating systems.
If you want to generate it in Java
More on the Blog
LinkedIn Real-Time Architecture: How Does LinkedIn Identify Its Users Online?
What Database Does Facebook Use? – A 1000 Feet Deep Dive
How many developers do I need for my startup – A deep dive
How Long Does It Take to Learn JavaScript & Get A Freakin Job?
The Beginners Guide to Beaker Browser & P2P Peer to Peer Web Apps
Guys!!
This was pretty much it about generating unique identifiers. I believe this much amount of info will suffice to have a fundamental understanding of what universally unique ids are & how they are generated.
If you liked the article. Do let me know in the comments. I would love to know your views on this. Any kind of feedback would be really meaningful to me. Also, do share it with your friends.
See you in the next article.
Until then…
Cheers!!
Shivang
Related posts
Zero to Software Architect Learning Track - Starting from Zero to Designing Web-Scale Distributed Applications Like a Pro. Check it out.
Master system design for your interviews. Check out this blog post written by me.
Zero to Software Architect Learning Track - Starting from Zero to Designing Web-Scale Distributed Applications Like a Pro. Check it out.
Recent Posts
- System Design #3: Leveraging the Backends for frontends pattern to avert API gateway from becoming a system bottleneck
- System Design #2: Understanding API gateway and the need for it
- System Design #1: CDN and Load balancers (Understanding the request flow)
- System Design #5: How Actor model/Actors run in clusters facilitating asynchronous communication in distributed systems
- System Design#4: Understanding the Actor model to build non-blocking, high-throughput distributed systems
Follow Me On Social Media