*Descriptions are from each technology's site.
I did some work on Hadoop around 2008-2010. It's 2020, and seems like I might have to use it again. Hadoop was painful to use back then. It's improved a lot and there are now a lot of good tools around it. At least the hype died down, and more people have experience with it, thus more realistic view on big data and Hadoop.
Here, I'm listing big data and Hadoop related technologies as there are so many of them.
Hadoop - a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
http://hadoop.apache.org
HDFS - distributed file system
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://cwiki.apache.org/confluence/display/HADOOP2/HDFS/
Hadoop MapReduce - A YARN-based system for parallel processing of large data sets.
https://en.wikipedia.org/wiki/MapReduce
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
BigTop - for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. Bigtop supports a wide range of components/projects, including, but not limited to, Hadoop, HBase and Spark.
http://bigtop.apache.org
YARN - framework for job scheduling and cluster resource management
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Hadoop Ozone - An object store for Hadoop.
https://hadoop.apache.org/ozone/
Hadoop Submarine - A machine learning engine for Hadoop.
https://hadoop.apache.org/submarine/
Ambari - A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
https://ambari.apache.org
Avro - A data serialization system.
https://avro.apache.org
Parquet - columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
https://parquet.apache.org/
Orc - smallest, fastest columnar storage for Hadoop. Includes support for ACID transactions and snapshot isolation
https://orc.apache.org
arrow - cross-language development platform for in-memory data; Columnar memory
https://arrow.apache.org/
Impala - native analytic database for Apache Hadoop
http://impala.apache.org/index.html
Flink - Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams
https://flink.apache.org/usecases.html
Kudu - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer.
https://kudu.apache.org/
Cassandra - A scalable multi-master database with no single points of failure.
https://cassandra.apache.org
Chukwa - A data collection system for managing large distributed systems.
https://chukwa.apache.org
HBase - A scalable, distributed database that supports structured data storage for large tables.
Column oriented, non-RDBMS on top of Hadoop.
https://hbase.apache.org
Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying. Data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.
https://hive.apache.org
Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
https://flume.apache.org/download.html
Mahout - A Scalable machine learning and data mining library.
A distributed linear algebra framework and mathematically expressive Scala DSL designed to let mathematicians, statisticians, and data scientists quickly implement their own algorithms. Apache Spark is the recommended out-of-the-box distributed back-end, or can be extended to other distributed backends.
https://mahout.apache.org
Pig - A high-level data-flow language and execution framework for parallel computation.
A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
https://pig.apache.org
Oozie - a workflow scheduler system to manage Apache Hadoop jobs
https://oozie.apache.org/
Spark - a unified analytics engine for large-scale data processing.
Can be used for: ETL, machine learning, stream processing, and graph computation.
https://spark.apache.org
Tez - an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.
https://tez.apache.org
ZooKeeper - a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
https://zookeeper.apache.org
mesos - built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire datacenter and cloud environments.
http://mesos.apache.org/
sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
https://sqoop.apache.org
UPDATE-20200305:
This page has ALL Open Source Apache Big Data Projects - https://dzone.com/articles/looking-at-all-the-open-source-apache-big-data-pro
No comments:
Post a Comment