This is note on installation and testing with DBs, Big Data, ETL, analytics, business intelligence software.
Environments, DBsBelow list shows where DBs and Software are installed:
- Win10 – MySQL DB, Orange, PySpark
- CentOS7 – Oracle 18c XE
- Ubuntu18, headless – MariaDB, Postgresql, Hadoop, HBase, Hive, Spark/PySpark, Mahout, PrestoDB, Pentaho, Luigi, Scriptella
- Ubuntu18 – DB2
- Postgresql, https://www.digitalocean.com/community/tutorials/how-to-install-and-use-postgresql-on-ubuntu-18-04
- MariaDB, https://www.itzgeek.com/how-tos/linux/ubuntu-how-tos/install-mariadb-on-ubuntu-16-04.html
- Oracle, https://blog.keithkim.com/2019/10/installing-oracle-on-centos.html
- DB2, https://blog.keithkim.com/2020/02/note-install-db2-express-c-on-ubuntu-18.html
Orange - Analytics, data mininghttps://orange.biolab.si/
If behind the FW and/or Proxy, and conda install doesn't work, then download and install:
If not, use Anaconda install steps:
> conda config --add channels conda-forge
> conda install orange3
> activate <conda environment with Orange>
Orange is very promising, great features and GUI - but not mature enough. It only supports Postgresql DB for example.
Spark - Analyticshttps://spark.apache.org/
PrestoDBhttps://prestodb.io/- Presto is distributed SQL query engine, connecting to multiple/multi-type DBs, such as Hadoop, RDBMS, NoSQL
Pentaho - ETL, Analytics, Reporthttps://www.pentaho.com/ - Pentaho is consist of multiple packages: ETL, Analytics, Business Intelligence.
Flowablehttps://flowable.com - Java based. Seems pretty good.
Luigi - Python based ETLhttps://github.com/spotify/luigi
Developed by Spotify. Looks pretty promising.
Python Based Tools
- Bubbles - http://bubbles.databrewery.org
Programming based ETL. Haven't tested it much, but is it better than Pandas (https://pandas.pydata.org)?
- Bonobo - https://www.bonobo-project.org
- Pygrametl - https://chrthomsen.github.io/pygrametl/
- PETL - https://pypi.org/project/petl/
Very different concept - using Python but shell with pipe. Feels like IFTTT for ETL in shell running on local machine.
Scriptella - XML based ETL in JavaWritten in Java, https://scriptella.org
If you're familiar with Spring Framework, Spring Batch is another option - https://mkyong.com/tutorials/spring-batch-tutorial/
- Apache Nifi - https://nifi.apache.org
I wasn't too impressed. It feels still very early version.
- Apache Airflow - https://airflow.apache.org
Python based, doesn't run on Windows. Tools should be cross-platform.
- SQL Workbench, http://www.sql-workbench.net/
- Silk - linked data integration framework, http://silkframework.org
- OpenSemantic - https://opensemanticsearch.org/etl
- Java based - https://blog.panoply.io/18-etl-tools-that-do-more-with-java
- Python based - https://www.xplenty.com/blog/python-etl-2019-a-list-and-comparison-of-the-top-python-etl-tools/