Every single day 2.5 Quintillion bytes of data is generated on the internet. Focus in this post is on the considerations for managing the unstructured data & storage offerings on cloud that meet the needs.
By some estimates 90% of the world's data was generated in last 2 years; to give you an idea on how much data is created every Minute on the internet:
- There are 2.78 Million Video views on YouTube
- 300 Hours of video is uploaded to YouTube
- 13,300+ Hours of music streamed from Spotify
- 204 Million Emails sent
- 123,060 uploads to Instagram
A lions share of this data is unstructured in nature. Its in the form of videos & images that are uploaded by the applications on multiple devices. Internet Applications dealing with unstructured data need to be designed with certain storage considerations which are typically not an issue for traditional applications.
Storage considerations for the internet scale applications
- Storage and management of large volumes of data
- Serving the data concurrently to multiple users at the same time i.e., concurrent serving of the same data such as the video without I/O being the bottlesck. (Who likes to watch that choppy cat video? Not me :-)
- Ready to be able to cater to unpredictable growth of the application usage (i.e., storage usage)
- Data needs to be available at all times i.e., no maintenance downtimes
- Need to have maximum durability i.e., zero data loss or corruption issues
Traditional Storage technology challenges
Traditionally enterprises have invested in on-premise data storage systems such as Network Attached Storage (NAS) and Storage Area Network (SAN). These storage technologies worked out great for the applications of yester years that dealt with mostly textual data, lightweight media such as images and a predictable number of application users. For various reasons these technologies are not suitable for the cloud scale applications.
- Costly to manage
- High failure rates (storage hardware failure rates)
- Challenge of scalability
- Downtime during maintenance
The good news is that storage technology has kept pace with the needs of the internet scale applications. A better news is that cost of storage is constantly going down. And the best news is that storage technology meet the needs of the internet scale applications.
All cloud vendors today offer two types of elastic storage:
Block storage used for distributed databases (RDBM as well as NoSQL) provides a scalable way of managing transactional data needs. That is mostly textual data (few kilobytes) in nature which is expected to undergo CRUD operations over its lifetime. The high availability for such data is typically achieved by data replication. In order to leverage the block storage on a cloud platform you need to leverage the compute resources on that same cloud platform. In other words I cannot purchase block storage on Amazon and then use it on Microsoft Azure.
There are various scenarios under which you have no choice but to use the block storage.
- Looking for High IOPS (Input Output Operations Per Second)
- System dependency e.g., Database systems need raw file systems
- Storage needs for Compute resources (VM)
Object (a.k.a BLOB) storage is suitable for static data that has more reads than writes and undergo no changes (yes a video file may get replaced by a new version). The object storage is offered as a self sustained & standalone managed service. In other words to leverage the object strage you do not need the compute resources. Also this kind of storage can be leveraged from outside the cloud platform. For example you can use Amazon S3 Storage from an application deployed in a virtual machine on Microsoft Azure.
This is a new kid on the block that promises to address the storage aspects of the internet scale applications. Characteristics of the Object storage sevices out of the box:
- High Redundancy with near real time replication
This is truly transparent to the consumer of the servcie
- Horizontal scaling leading to infinite storage capacity
Additional storage units may be added to the object storage pool without requiring any dowtime or changes in applications
- Automatic recovery from failures
Hardware filure of storage unit does not impact the end user as the data gets served from the units that are available.
- Great performance using distributed archietcture
The data is typically stored across multiple data centers. The data is served from the data center closest to the end user thus leading to better performance characteristics
Object (a.k.a. BLOB) storage versus Alternatives
You may argue that the problem of data has already been solved why this is such a big thing. Let's look at the alternatives:
- File system
You may create a webserver farm and place your static data on the file system. This was the generation one implementation that was marred by many shortcomings such as I/O issues during heavy load, downtime during maintenance, frequent hardware failures. In short NOT a good idea for the Internet applications.
- NoSQL Database
For small size data, no sql databases are still a good choice. So if your data can be managed in the form of JSON object then you may be better off storing it in a replicated NoSQL database. But if your data is binary in nature, consider the ups/downs of your specific NoSQL database. Do consider it for for small size data.
- Content Management System (CMS)
These systems are (mostly) built as web applications using the traditional undrelying technologies. The big plus of these systems is the ease with which the content may be managed. Let's say you have a website, in which some of the content such as product images and description is managed by the marketing department. Now you don't want the IT department to be engaged for such changes, CMS gives the end user capability to make such changes. You may need to use it for some of the data for your applications, as the alternative may be to write your own CMS on top of block stoarge (not recommeneded for obvious reasons).
Object Storage Offerings
There are many cloud providers that are offering storage today. In fact there are some niche providers who are offering just storage on the cloud. Here I am covering the commonly used services on the cloud but there are many more.
More or less they share the common characteristics of the object storage discussed above. The difference is in how your application interacts with the storage service e.g., Amazon S3 provides an SDK, Bluemix Object storage service is accessed over RESTful services.
As an architect/designer of the cloud applications it is a must that storage considerations be treated as a first class citizen and based on the use case a decision be made on how both the Block & Object storage services will be leveraged by the application. Especially when it comes to management of the unstructured data which may be of much larger in volume you now have the option of leveraging the Object/Blob storage from your applications.