February 16, 2020

Big Data Technologies

*Descriptions are from each technology's site.

I did some work on Hadoop around 2008-2010.  It's 2020, and seems like I might have to use it again.  Hadoop was painful to use back then.  It's improved a lot and there are now a lot of good tools around it.  At least the hype died down, and more people have experience with it, thus more realistic view on big data and Hadoop.

Here, I'm listing big data and Hadoop related technologies as there are so many of them.



Hadoop - a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
http://hadoop.apache.org

HDFS - distributed file system

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://cwiki.apache.org/confluence/display/HADOOP2/HDFS/

Hadoop MapReduce - A YARN-based system for parallel processing of large data sets.
https://en.wikipedia.org/wiki/MapReduce
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

BigTop - for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. Bigtop supports a wide range of components/projects, including, but not limited to, Hadoop, HBase and Spark.

http://bigtop.apache.org

YARN - framework for job scheduling and cluster resource management
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Hadoop Ozone - An object store for Hadoop.
https://hadoop.apache.org/ozone/

Hadoop Submarine - A machine learning engine for Hadoop.
https://hadoop.apache.org/submarine/

Ambari - A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
https://ambari.apache.org

Avro - A data serialization system.
https://avro.apache.org

Parquet - columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
https://parquet.apache.org/

Orc - smallest, fastest columnar storage for Hadoop. Includes support for ACID transactions and snapshot isolation
https://orc.apache.org

arrow - cross-language development platform for in-memory data; Columnar memory
https://arrow.apache.org/

Impala - native analytic database for Apache Hadoop
http://impala.apache.org/index.html

Flink - Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams
https://flink.apache.org/usecases.html

Kudu - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer.
https://kudu.apache.org/

Cassandra - A scalable multi-master database with no single points of failure.
https://cassandra.apache.org

Chukwa - A data collection system for managing large distributed systems.
https://chukwa.apache.org

HBase - A scalable, distributed database that supports structured data storage for large tables.

Column oriented, non-RDBMS on top of Hadoop.
https://hbase.apache.org

Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying. Data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.
https://hive.apache.org

Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
https://flume.apache.org/download.html

Mahout - A Scalable machine learning and data mining library.
A distributed linear algebra framework and mathematically expressive Scala DSL designed to let mathematicians, statisticians, and data scientists quickly implement their own algorithms. Apache Spark is the recommended out-of-the-box distributed back-end, or can be extended to other distributed backends.
https://mahout.apache.org

Pig - A high-level data-flow language and execution framework for parallel computation.
A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
https://pig.apache.org

Oozie - a workflow scheduler system to manage Apache Hadoop jobs
https://oozie.apache.org/

Spark - a unified analytics engine for large-scale data processing.
Can be used for: ETL, machine learning, stream processing, and graph computation.
https://spark.apache.org

Tez - an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.
https://tez.apache.org

ZooKeeper - a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
https://zookeeper.apache.org

mesos - built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire datacenter and cloud environments.
http://mesos.apache.org/

sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
https://sqoop.apache.org




UPDATE-20200305: 
This page has ALL Open Source Apache Big Data Projects - https://dzone.com/articles/looking-at-all-the-open-source-apache-big-data-pro

No comments: