Since we now know why to get to get ourselves familiarized with Big Data(there is an article that predicts the lack of talent in deep data analysis by 2018 up to 50 – 60 percent relative to the project supply), I will let you know how to go about with getting yourself familiarized.
Just the core of why Big Data (for all those who might not remember what you have seen in the previous post)
The key enablers for the growth of Big Data are
- Increase of storage capacity
- Increase of processing power
- Availability of data
So to implement Big Data or to know about and its implementation we have to know why it is implemented and the ways in which it can be implemented.
We prefer Big Data because
- Almost 94%of our data are digitized.
- The computational capacity has increased to over 1012 million instructions per second!
- Companies have at least have a few Terabyte of data stored and some even up to one petabyte.
- The type of data produces can be broadly classified into Video,Image,Audio, and Text/Numbers.
- Due to the mobility of smartphones, the number of people accessing these information has increased.
- And let’s not forget about the data/Information available in the “Internet of Things”!!!
When we talk about Big Data there are a few basic terms that we gotta know before actually stepping into the zone and applying the known tools!
So, there is this Big Data Activity/Value chain
- Generate Data
- Aggregate Data
- Analyze Data
- Consumer Data and Derive Data
(I’ll come to this later…)
The tool typically used in Big Data Scenarios are:
Why so many tools? Cause Big Data is a really hard problem! When?
- When we have to perform complex operations on data (Analyzing,Categorizing,Modelling and Reasoning)
- And even if we have techniques to solve the above-mentioned operations we have to consider the scale of the data under account
The Big data technologies
Among the big data technologies what you’ll need to know the fundamental or those that encapsulate the rest are are the :
- Apache Software Foundations Java-based Hadoop programming framework (Pretty big name eh!) that can run applications on systems with thousands of nodes;
- The infamous MapReduce software framework, which consists of a Map function that distributes work to different nodes and a Reduce function that gathers results and resolves them into a single value.
- Also gaining more attention is the Apache Hive data warehousing component, which offers a query language called HiveQL that translates SQL-like queries into MapReduce jobs automatically.
- Finally, how Microsoft is trying to get in on the Hadoop action with its own SQL Server-Hadoop connectors.