Navigating Analytics
Expertise in Building and Managing Teams, Structures and Processes for Data Management, Analytics, Modeling, Optimization, Business Intelligence Reporting Dashboards and Real-time Streaming from Cloud Technologies
Big Data + Open Source + Inexpensive High Powered Parallel Processing + Grid Technologies = Driving the Best in Analytics to the Next Levels for the Best in Business Performance
Aggregation of Big Data From All Sources
With what was originally started in 2003 in order to index the web so that it can be searched through multi-machine processing, the Nutch project led by Doug Cutting included a distributed file system and a MapReduce facility. The framework papers on Google File System (GFS) and MapReduce were published in 2004 by Google. The GFS with fault-tolerant processing of large amounts of data can generate and also store tremendous amount of data. MapReduce, a programming model has the capacity to support parallel processing of huge amount of data.
Today, there is Apache Hadoop project which offered open-source software that is not only reliable, scalable, available but also allow for distributed computing, leveraging of cheap commodity hardware networked together that can be located in the same location, scale out and can read 2TB of data in 3.5 minutes.
Huge Forms of Information Are Becoming Accessible - Awesome!!!
The availability of high powered parallel processing, Hadoop cluster and grid technologies to capture, process, analyze, visualize massive amount of data which are growing daily at exponential rate; the access to scalable data storage solutions like cloud platforms; increased processing of data streams on the go; the advent of faster computing technologies with intensive capacities to crush the data, ability to extract robust information and make sense in visual presentation of the information from the data are combining to make analytics attractive to the business leaders in making critical decisions to drive their respective organizations to the next level.
Value Proposition - Big Data
Haddop EcoSystem
Different Forms of Structured and Unstructured Data
Type of Big and Small Data - Unstructured and Structured Data
Nature of Big Data: What Comes to Mind When We Talk About Big Data
Beyond Relational Database Management Systems (RDBMS): Welcome to Hadoop Ecosystem Where Rapid Development of Additional Products is Exciting to Data and Analytics Enthusiasts
Hadoop Distributed File System (HDFS)
Initially, the best known products within the Hadoop Ecosystem were the Hadoop Distributed File System (HDFS) and MapReduce.
Hadoop Distributed File System (HDFS), based on the Google File System (GFS) can run on top of typical native filesystem. It splits files into blocks stored as datanode but managed by namenode and allows for streaming data access patterns that entails reads and appending
MapReduce
MapReduce, a Google originated technology, with map and reduce fuctionality remains the workhorse that data and analytics enthusiasts really like because of the capacity to process large amount of data in parallel. Tasks can be localized and scheduled to be run at a specified time.
YARN
YARN came in as improvement to MapReduce in terms of being able to split JobTracker into two daemons and also elevating the efficiency in memory management.
Hadoop Column Database or Hbase
Today, there are the Hadoop Column Database or Hbase, open source, a NoSQL DB, written in Java, distributed to handle large tables, focuses on batch and random read even with limited queries, easily scalable but will run on a cluster of “cheap” hardware.
Apache Pig
At times when there are issues in scripting, Apache Pig can be used to drive data flow scripting in multi-step jobs that easily translate into MapReduce tasks
Hive
There is Hive which remains the data warehousing infrastructure with ability to tap into HDFS and Hbase. With Hive query language (HQL) translated into MapReduce jobs, Hive executes the jobs on Hadoop Cluster.
Sqoop
Sqoop which helps with importing database.
Avro
Avro is the data serialization system.
Oozie
Oozie handles the workflowschedules and management.
Zookeeper
The facilitation or simply the “high-performance coordination” coordination service tool for distributed apps to be able to track servers is the Zookeeper. The Zookeeper is a very good addition to Hbase especially for region assignments.
Chukwa
Chukwa handles large-scale log of data collection while providing analysis and monitoring of collected data.
Given the fault-tolerance within the Hadoop ecosystem, there are capabilities to self-heal, self-start and self-manage in dealing with not only the data but also the computation loads. The environment allows for not only blocks with data to be replicated to multiple nodes, restart tasks on another node if taking too long but also does not lose data just because there are failures in the nodes!
The big users of the Hadoop are easily identified as the producers of the big data and these include Yahoo, Facebook, Netflix, LinkedIn, CNET, NY Times.
Mahout
Within the Analytics domain, there are the machine learning libraries for Hadoop that include Mahout. Mahout allows for Genetic algorithms, clustering, pattern recognition and collaborative filtering
Does it mean that with Hadoop Ecosystem, there is no need for Relational Database Management Systems (RDBMS) anymore? The answer is a resounding No!