February 16, 2020

Big Data Technologies

*Descriptions are from each technology's site.

I did some work on Hadoop around 2008-2010.  It's 2020, and seems like I might have to use it again.  Hadoop was painful to use back then.  It's improved a lot and there are now a lot of good tools around it.  At least the hype died down, and more people have experience with it, thus more realistic view on big data and Hadoop.

Here, I'm listing big data and Hadoop related technologies as there are so many of them.

Hadoop - a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models

HDFS - distributed file system


Hadoop MapReduce - A YARN-based system for parallel processing of large data sets.

BigTop - for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. Bigtop supports a wide range of components/projects, including, but not limited to, Hadoop, HBase and Spark.


YARN - framework for job scheduling and cluster resource management

Hadoop Ozone - An object store for Hadoop.

Hadoop Submarine - A machine learning engine for Hadoop.

Ambari - A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

Avro - A data serialization system.

Parquet - columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Orc - smallest, fastest columnar storage for Hadoop. Includes support for ACID transactions and snapshot isolation

arrow - cross-language development platform for in-memory data; Columnar memory

Impala - native analytic database for Apache Hadoop

Flink - Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams

Kudu - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer.

Cassandra - A scalable multi-master database with no single points of failure.

Chukwa - A data collection system for managing large distributed systems.

HBase - A scalable, distributed database that supports structured data storage for large tables.

Column oriented, non-RDBMS on top of Hadoop.

Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying. Data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Mahout - A Scalable machine learning and data mining library.
A distributed linear algebra framework and mathematically expressive Scala DSL designed to let mathematicians, statisticians, and data scientists quickly implement their own algorithms. Apache Spark is the recommended out-of-the-box distributed back-end, or can be extended to other distributed backends.

Pig - A high-level data-flow language and execution framework for parallel computation.
A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Oozie - a workflow scheduler system to manage Apache Hadoop jobs

Spark - a unified analytics engine for large-scale data processing.
Can be used for: ETL, machine learning, stream processing, and graph computation.

Tez - an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.

ZooKeeper - a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

mesos - built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire datacenter and cloud environments.

sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

This page has ALL Open Source Apache Big Data Projects - https://dzone.com/articles/looking-at-all-the-open-source-apache-big-data-pro

No comments: