Personal tools
You are here: Home Research Trends & Opportunities New Media and Technology Data Science and Analytics The Characterization and Types of Big Data

The Characterization and Types of Big Data

Air Station Miramar_Jeff M. Wang_9764731
(U.S. Navy Blue Angels, U.S. Marine Corps Air Station Miramar, Jeff M. Wang)


The Characterization of Big Data


A precise specification of ‘big’ is elusive. What is considered big for one organization may be small for another. What is large-scale today will likely seem small-scale in the near future; petabyte is the new terabyte. Thus, size alone cannot specify big data. The complexity of the data is an important factor that must also be considered. 

The following 6 V’s are the dimensions that characterize big data, and also embody its challenges: We have huge amounts of data, in different formats and varying quality, that must be processed quickly:


  • Volume: This refers to the vast amounts of data that is generated every second/minute/hour/day in our digitized world.
  • Velocity: This refers to the speed at which data is being generated and the pace at which data moves from one point to the next.
  • Variety: This refers to the ever-increasing different forms that data can come in, e.g., text, images, voice, geospatial.
  • Veracity: This refers to the quality of the data, which can vary greatly.
  • Valence: This refers to how big data can bond with each other, forming connections between otherwise disparate datasets.
  • Value: Processing big data must bring about value from insights gained.


It is important to note that the goal of processing big data is to gain insight to support decision-making. It is not sufficient to just be able to capture and store the data. The point of collecting and processing volumes of complex data is to understand trends, uncover hidden patterns, detect anomalies, etc. so that you have a better understanding of the problem being analyzed and can make more informed, data-driven decisions. 

To address the challenges of big data, innovative technologies are needed. Parallel, distributed computing paradigms, scalable machine learning algorithms, and real-time querying are key to analysis of big data. Distributed file systems, computing clusters, cloud computing, and data stores supporting data variety and agility are also necessary to provide the infrastructure for processing of big data. Workflows provide an intuitive, reusable, scalable and reproducible way to process big data to gain verifiable value from it in and enable application of same methods to different datasets.


The Data Types of Big Data


Three data types come with Big Data: structured, unstructured, and semi-structured.


Structured Data


Structured data has a long history and is the type used commonly in organizational databases. It has a high level of organization making it predictable, easy to organize and very easily searchable using basic algorithms. The information is rigidly arranged. Data is entered in specific fields containing textual or numeric data. These fields often have their maximum or expected size defined. In addition to the firm structure for information, structured data has very set rules concerning how to access it.

Examples of structured data include relational databases and other transactional data like sales records, as well as Excel files that contain customer address lists. This type of data is generally stored in tables. 


Unstructured Data


Unstructured data is not organized in any discernable manner and has no associated data model. Unstructured data files often include text and multimedia content. Examples include e-mail messages, word processing documents, presentations, webpages, videos, photos, audio files, satellite images, some health records, and many other kinds of business documents. Note that while these sorts of files may have an internal structure, they are still considered "unstructured" because the data they contain doesn't fit neatly in a database.


Semi-Structured Data

Semi-structured data is a type of data containing semantic tags, but does not conform to the structure associated with typical relational databases. While semi-structured entities belong in the same class, they may have different attributes. Examples include email, XML and other markup languages.

The Byte Scale


This is an intuitive look at large data sizes:


  • Bytes(8 Bits) 
  • Kilobyte (1000 Bytes) 
  • Megabyte (1 000 000 Bytes) 
  • Gigabyte (1 000 000 000 Bytes) 
  • Terabyte (1 000 000 000 000 Bytes) 
  • Petabyte (1 000 000 000 000 000 Bytes) 
  • Exabyte (1 000 000 000 000 000 000 Bytes) 
  • Zettabyte (1 000 000 000 000 000 000 000 Bytes) 
  • Yottabyte (1 000 000 000 000 000 000 000 000 Bytes) 
  • Xenottabyte (1 000 000 000 000 000 000 000 000 000 Bytes) 
  • Shilentnobyte (1 000 000 000 000 000 000 000 000 000 000 Bytes) 
  • Domegemegrottebyte (1 000 000 000 000 000 000 000 000 000 000 000 Bytes)  
Note: The kilobyte is a multiple of the unit byte for digital information. The International System of Units (SI) defines the prefix kilo as 1000 (10**3); per this definition, one kilobyte is 1000 bytes.In historical usage in some areas of information technology, particularly in reference to digital memory capacity, kilobytedenotes 1024 (2**10) bytes.

8 bits = 1 byte 
1024 bytes = 1 kilobyte 
1024 kilobytes = 1 megabyte 
1024 megabytes = 1 gigabyte
1024 gigabytes = 1 terabyte 
1024 terabytes = 1 petabyte 
1024 petabytes = 1 exabyte 
1024 exabytes = 1 zettabyte
1024 zettabytes = 1 yottabyte
1024 yottabytes - 1 xenottabyte
1024 xenottabytes = 1 shilentnobyte
1024 shilentnobytes = 1 domegemegrottebyte


[More to come ...]

Document Actions