In this blog, we will discuss the terminology related to Big Data ecosystem. This will give you a complete understanding of Big Data and its terms.
Over time, Hadoop has become the nucleus of the Big Data ecosystem, where many new technologies have emerged and have got integrated with Hadoop. So it’s important that, first, we understand and appreciate the nucleus of modern Big Data architecture.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers, using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Components of the Hadoop Ecosystem
Let’s begin by looking at some of the components of the Hadoop ecosystem:
Hadoop Distributed File System (HDFS™):
This is a distributed file system that provides high-throughput access to application data. Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. In this method, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability needed for Big Data processing.
MapReduce is a programming model specifically implemented for processing large data sets on Hadoop cluster. This is the core component of the Hadoop framework, and it is the only execution engine available for Hadoop 1.0.
The MapReduce framework consists of two parts:
1. A function called ‘Map’, which allows different points in the distributed cluster to distribute their work.
2. A function called ‘Reduce’, which is designed to reduce the final form of the clusters’ results into one output.
The main advantage of the MapReduce framework is its fault tolerance, where periodic reports from each node in the cluster are expected as soon as the work is completed.
The MapReduce framework is inspired by the ‘Map’ and ‘Reduce’ functions used in functional programming. The computational processing occurs on data stored in a file system or within a database, which takes a set of input key values and produces a set of output key values.
Each day, numerous MapReduce programs and MapReduce jobs are executed on Google’s clusters. Programs are automatically parallelized and executed on a large cluster of commodity machines.
Map Reduce is used in distributed grep, distributed sort, Web link-graph reversal, Web access log stats, document clustering, Machine Learning and statistical machine translation.
Pig is a data flow language that allows users to write complex MapReduce operations in simple scripting language. Then Pig then transforms those scripts into a MapReduce job.
For more information on pig refer our Beginner’s guide
Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism for querying the data using a SQL-like language called HiveQL. At the same time, this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
For more information on hive refer out Beginner’s guide
Enterprises that use Hadoop often find it necessary to transfer some of their data from traditional relational database management systems (RDBMSs) to the Hadoop ecosystem.
Sqoop, an integral part of Hadoop, can perform this transfer in an automated fashion. Moreover, the data imported into Hadoop can be transformed with MapReduce before exporting them back to the RDBMS. Sqoop can also generate Java classes for programmatically interacting with imported data.
Sqoop uses a connector-based architecture that allows it to use plugins to connect with external databases.
For more info about Sqoop and its usage click here.
Flume is a service for streaming logs into Hadoop. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS).
Storm is a distributed, real-time computation system for processing large volumes of high-velocity data. Storm is extremely fast and can process over a million records per second per node on a cluster of modest size. Enterprises harness this speed and combine it with other data-access applications in Hadoop to prevent undesirable events or to optimize positive outcomes.
Apache Kafka supports a wide range of use cases such as a general-purpose messaging system for scenarios where high throughput, reliable delivery, and horizontal scalability are important. Apache Storm and Apache HBase both work very well in combination with Kafka.
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. The Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions, whereas the Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability.
Oozie is integrated with the rest of the Hadoop stack and supports several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system-specific jobs (such as Java programs and shell scripts). Oozie is a scalable, reliable and extensible system.
Apache Spark is a fast, in-memory data processing engine for distributed computing clusters like Hadoop. It runs on top of existing Hadoop clusters and accesses the Hadoop data store (HDFS).
Spark can be integrated with Hadoop’s 2 YARN architecture, but cannot be used with Hadoop 1.0.
For more information on spark refer our Beginner’s guide
Get an introduction to RDD’s in python RDD-part1
Apache Solr is a fast, open-source Java search server. Solr enables you to easily create search engines that search websites, databases, and files for Big Data.
Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology.
YARN is one of the key features in the second-generation Hadoop 2 version of the Apache Software Foundation’s open-source distributed processing framework. Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications.
Tez is an execution engine for Hadoop that allows jobs to meet the demands for fast response times and extreme throughput at petabyte scale. Tez represents computations as a dataflow graphs and can be used with Hadoop 2 YARN.
Apache Drill is an open-source, low-latency query engine for Hadoop that delivers secure, interactive SQL analytics at petabyte scale. With the ability to discover schemas on the go, Drill is a pioneer in delivering self-service data exploration capabilities on data stored in multiple formats in files or NoSQL databases. By adhering to ANSI SQL standards, Drill does not require a learning curve and integrates seamlessly with visualization tools.
Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and co-ordinates the running of those scans to produce a regular JDBC result set. Apache Phoenix enables OLTP and operational analytics in Hadoop for low-latency applications by combining the best of both worlds. Apache Phoenix is fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and Map Reduce.
Cloud Computing is a type of computing that relies on sharing computing resources rather than having local servers or personal devices to handle applications. Cloud Computing is comparable to grid computing, a type of computing where the unused processing cycles of all computers in a network are harnessed to solve problems that are too processor-intensive for any single machine.
In Cloud Computing, the word cloud (also phrased as “the cloud”) is used as a metaphor for the Internet, hence the phrase cloud computing means “a type of Internet-based computing” in which different services such as servers, storage and applications are delivered to an organization’s computers and devices via the Internet.
The NoSQL database, also called Not Only SQL, is an approach to data management and database design that’s useful for very large sets of distributed data. This database system is non-relational, distributed, open-source and horizontally scalable. NoSQL seeks to solve the scalability and big-data performance issues that relational databases weren’t designed to address.
Apache Cassandra is an open-source distributed database system designed for storing and managing large amounts of data across commodity servers. Cassandra can serve as both a real-time operational data store for online transactional applications and a read-intensive database for large-scale business intelligence (BI) systems.
Amazon Simple Database Service (SimpleDB), also known as a key value data store, is a highly available and flexible non-relational database that allows developers to request and store data, with minimal database management and administrative responsibility.
This service offers simplified access to a data store and query functions that let users instantly add data and effortlessly recover or edit that data.
Google’s BigTable is a distributed, column-oriented data store created by Google Inc. to handle very large amounts of structured data associated with the company’s Internet search and Web services operations.
BigTable was designed to support applications requiring massive scalability; from its first iteration, the technology was intended to be used with petabytes of data. The database was designed to be deployed on clustered systems and uses a simple data model that Google has described as “a sparse, distributed, persistent multidimensional sorted map.” Data is assembled in order by row key, and indexing of the map is arranged according to row, column keys, and timestamps. Here, compression algorithms help achieve high capacity.
MongoDB is a cross-platform, document-oriented database. Classified as a NoSQL database, MongoDB shuns the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.
MongoDB is developed by MongoDB Inc. and is published as free and open-source software under a combination of the GNU Affero General Public License and the Apache License. As of July 2015, MongoDB is the fourth most popular type of database management system, and the most popular for document stores.
Apache HBase (Hadoop DataBase) is an open-source NoSQL database that runs on the top of the database and provides real-time read/write access to those large data sets.
HBase scales linearly to handle huge data sets with billions of rows and millions of columns, and it easily combines data sources that use a wide variety of different structures and schema. HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
For more information on HBase refer to our Beginner’s guide
Neo4j is a graph database management system developed by Neo Technology, Inc. Neo4j is described by its developers as an ACID-compliant transactional database with native graph storage and processing. According to db-engines.com, Neo4j is the most popular graph database.
CouchDB works well with modern web and mobile apps. You can even serve web apps directly out of CouchDB. You can distribute your data, or your apps, efficiently using CouchDB’s incremental replication. CouchDB supports master-master setups with automatic conflict detection.
Big Data Related Job Roles:
A Data Analyst is someone who analyzes large data sets, draws inferences from them, and projects this to the management using reporting tools. A Data Analyst usually has a degree in Computer Science or an M.B.A and additionally needs the following technical skills:
Knowledge of statistical programming languages like R and SAS to manipulate data.
Knowledge of programming languages like Python, or Ruby for web development. or familiarity with HTML and Java Scripting for front-end development to present data.
Knowledge of SQL querying.
Knowledge of Excel.
Knowledge of open source tools like Hadoop, Hive, Pig, Impala, and HBase to improve productivity for analysis tasks.
A Hadoop developer is responsible for the actual coding/programming of Hadoop applications. This role is synonymous to that of a software developer or application developer. One component of Hadoop is MapReduce, for which you need to write Java programs. So, if you have basic knowledge of Java, it’s sufficient. Even if you don’t have Java knowledge but know any other programming language, you can quickly catch up.
Ability to write MapReduce jobs.
Experience in writing Pig Latin scripts.
Hands on experience in HiveQL.
Familiarity with data loading tools like Flume, Sqoop.
Knowledge of workflow/schedulers like Oozie.
A Hadoop Administrator is responsible for the implementation and ongoing administration of the Hadoop infrastructure. His role is to align with the systems engineering team to propose and deploy new hardware and software environments required for Hadoop and to expand existing environments. He needs to work with the data-delivery teams to set up new Hadoop users. This job includes setting up Linux users, setting up Kerberos principals, and testing HDFS, Hive, Pig and MapReduce access for new users, cluster maintenance, and the creation and removal of nodes using tools like Ganglia, Nagios, Cloudera Manager Enterprise, Dell Open Manage and other tools.
A Data Scientist is a statistician and software engineer rolled into one. A Data Scientist should be an expert in math, statistics, technology, and business. But in reality, it’s not possible for a single person to be an expert in all these areas. As a result, Data Science teams have team members who have expertise in one area but are able to talk to other team members with expertise in another skill.
Now we will discuss Introduction to Machine learning and the terms related to it.
Machine Learning is a sub-field of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined Machine Learning as a “field of study that gives computers the ability to learn without being explicitly programmed.” Machine Learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than by following strictly static program instructions.
An algorithm is mathematical “logic”, or a set of rules used to make calculations. Starting with an initial input (which may be zero or null), the logic or rules are coded or written into software as a set of steps to be followed when conducting calculations, processing data or performing other functions, eventually leading to an output.
Predictive Analytics refers to the analysis of Big Data to make predictions and determine the likelihood of future outcomes, trends, or events. In business, it can be used to model various scenarios for how customers will react to new product offerings or promotions, or how the supply chain might be affected by extreme weather patterns or demand spikes. Predictive Analytics may involve various statistical techniques such as modeling, Machine Learning and Data Mining.
R is a programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, surveys of data miners, and studies of scholarly literature databases show that R’s popularity has increased substantially in recent years.
Python is a powerful multi-paradigm computer programming language. Python is heavily used in numeric programming, a domain that would not traditionally have been considered to be in the scope of scripting languages. Nevertheless, it has grown to become one of Python’s most compelling use cases.
Cluster Analysis, or Clustering, involves grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task in exploratory data mining, and a common technique for statistical data analysis used in many fields, including Machine Learning, pattern recognition, image analysis, information retrieval, bioinformatics and data compression.
Classification is a data-mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks.
A Classification task begins with a data set in which the class assignments are known. For example, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period. In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of investments, and so on. Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case
Natural Language Processing (NLP) is a branch of artificial intelligence that deals with making human language (in both written and spoken forms) comprehensible to computers. As a scientific discipline, NLP involves tasks such as identifying sentence structure and boundaries in documents, detecting key words or phrases in audio recordings, extracting relationships between documents, and uncovering meaning in informal or slang speech patterns. NLP can make it possible to analyze and recognize patterns in verbal data that is currently unstructured.
Sentiment Analysis or Opinion Mining:
Sentiment Analysis involves the capture and tracking of opinions, emotions or feelings expressed by consumers in various types of interactions or documents, including social media, calls to customer service representatives, surveys and the like. Text Analytics and Natural Language Processing are typical activities within a process of Sentiment Analysis. The goal is to determine or assess the sentiments or attitudes expressed toward a company, product, service, person, or event.
We hope this blog has helped you understand some of the terminology associated with Big Data.