February 16, 2020

Big Data Technologies

*Descriptions are from each technology's site.

I did some work on Hadoop around 2008-2010.  It's 2020, and seems like I might have to use it again.  Hadoop was painful to use back then.  It's improved a lot and there are now a lot of good tools around it.  At least the hype died down, and more people have experience with it, thus more realistic view on big data and Hadoop.

Here, I'm listing big data and Hadoop related technologies as there are so many of them.

Hadoop - a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models

HDFS - distributed file system


Hadoop MapReduce - A YARN-based system for parallel processing of large data sets.

BigTop - for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. Bigtop supports a wide range of components/projects, including, but not limited to, Hadoop, HBase and Spark.


YARN - framework for job scheduling and cluster resource management

Hadoop Ozone - An object store for Hadoop.

Hadoop Submarine - A machine learning engine for Hadoop.

Ambari - A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

Avro - A data serialization system.

Parquet - columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Orc - smallest, fastest columnar storage for Hadoop. Includes support for ACID transactions and snapshot isolation

arrow - cross-language development platform for in-memory data; Columnar memory

Impala - native analytic database for Apache Hadoop

Flink - Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams

Kudu - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer.

Cassandra - A scalable multi-master database with no single points of failure.

Chukwa - A data collection system for managing large distributed systems.

HBase - A scalable, distributed database that supports structured data storage for large tables.

Column oriented, non-RDBMS on top of Hadoop.

Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying. Data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Mahout - A Scalable machine learning and data mining library.
A distributed linear algebra framework and mathematically expressive Scala DSL designed to let mathematicians, statisticians, and data scientists quickly implement their own algorithms. Apache Spark is the recommended out-of-the-box distributed back-end, or can be extended to other distributed backends.

Pig - A high-level data-flow language and execution framework for parallel computation.
A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Oozie - a workflow scheduler system to manage Apache Hadoop jobs

Spark - a unified analytics engine for large-scale data processing.
Can be used for: ETL, machine learning, stream processing, and graph computation.

Tez - an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.

ZooKeeper - a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

mesos - built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire datacenter and cloud environments.

sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

February 15, 2020

Notes on Parquet and ORC

ORC (Optimized Row Columnar)
  • falttened data
  • light weight index + bloom filter
  • better compression
  • Better with Hive
  • much less GC

  • Nested data
  • Better with Spark

Note is in progress

Serverless: fnproject, OpenLambda

As far as I remember, Google is the first commercial service providing serverless computing environment.  It wasn't easy to use in the beginning, but it was very well scalable.  Not sure if it was cost effective though.

For a start up, it was very reasonable environment.  That was before container like Docker, and before AWS was popular.  My friend's startup back then used it for its backend.  As the product started to get spikes, scaling it was extremely simple. (but wasn't cheap.)

I've been waiting for cheap-affordable serverless computing environment for personal use.  It doesn't have to be distributed or cloud environment.  I just wanted a simple environment that I could run on one node or more.  And I just deploy a simple applications to it -- somewhat similar concept of dropping WAR in Tomcat, or like OSGi (e.g. Apache Felix - http://felix.apache.org/).  But they are Java only, and not for my purpose (small, simple, light apps).

After numerous searching, I found two serverless implementation that can run in a small environment, or even on a single node: fnproject and OpenLambda  Both of them are open source.


fn is very mature and complete project, supporting many languages (Go, Java, Node, Python, Ruby).  It was very simple to use and deploy...but the performance wasn't great.  I believe the implementation is using docker container, and container inside of container.

Requires docker, golang

PROS: simple and easy to install.  Support multiple languages
CONS: slow


This isn't as mature/complete as fn.  And I failed to install this on Ubuntu 18 VM.  Don't know why.  Compilation was done fine, but failed for 'make test-all'.  To compile, docker, gcc, make, and the latest golang is needed. (v12, v13 worked).  but test failed.  This is probably the same way as it requires docker; container in container approach.

PROS: couldn't test
CONS: n/a


IronFunctions, https://github.com/iron-io/functions - seems like development has stopped.  Looks like using the same approach, using Docker.  Didn't try this, and you shouldn't either.  Just listing here in case someone is interested.


So the search continues.  I'll probably use AWS lambda, the cost for small experiment/use shouldn't be too much.


[Note] old S/W - SparkleShre, Detachtty, QCL, Quipper

[This is just to leave a record.]

While cleaning up an old VM (CentOS 6), found some software I played around several years ago:

SparkleShare - using OwnCloud instead - https://owncloud.org/, blogged here - https://blog.keithkim.com/2018/10/public-cloud-storage-and-owncloud-for.html

Detachtty - no more LISP web app

Quantum computer - using https://qiskit.org/ instead, as blogged here - https://blog.keithkim.com/2019/11/quantum-computer-getting-started-qiskit.html

There were also BigTop, Jenkins, other Lisp (personal) projects on that VM.  I'm now installing another dev VM with CentOS 8.

Technologies come and go so fast.

February 14, 2020

[Note] Install Mahout

Environment: Ubuntu 18 64-bit Server
Prerequisite: Hadoop


$ wget http://archive.apache.org/dist/mahout/0.13.0/apache-mahout-distribution-0.13.0.tar.gz

Uncompress and install at /opt/mahout

$ vi .bashrc
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export MAHOUT_HOME=/opt/mahout
# use local
#export MAHOUT_LOCAL=true
# use with hadoop
export MAHOUT_LOCAL=""

$ . ~/.bashrc


$ mkdir test; cd test
$ wget http://www.grouplens.org/system/files/ml-100k.zip
$ unzip ml-100k.zip
$ cd ml-100k
$ hdfs dfs -mkdir -p /user/hduser
$ hadoop fs -put u.data u.data

$ mahout recommenditembased -s SIMILARITY_COOCCURRENCE –input u.data –output output.txt

More Info


February 13, 2020

[Note] Installing Hive, HBase

Environment: Ubuntu 18 Server 64-bit on VirtualBox host Windows 10
Prerequisite: Hadoop Install

Install Hive

Download, http://apache-mirror.8birdsvideo.com/hive/hive-2.3.6/apache-hive-2.3.6-bin.tar.gz

Uncompress and install in /opt/hive

Prep env vars

$ vi ~/.bashrc
# --- hive
export HIVE_HOME=/opt/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$(hadoop classpath)

$ . ~/.bashrc

Copy and set environment script

$ cd /opt/hive/conf
$ cp hive-env.sh.template hive-env.sh
$ vi hive-env.sh

Copy default config

$ cd /opt/hive/conf
$ cp hive-default.xml.template hive-default.xml

Prep Hadoop FS

$ hdfs dfs -mkdir -p /user/hduser/warehouse
$ hdfs dfs -mkdir /tmp
$ hdfs dfs -chmod g+w /user/hduser/warehouse
$ hdfs dfs -chmod g+w /tmp

Prep metadata db

$ schematool -initSchema -dbType derby


$ hive
hive> create database test;
hive> show databases;

See here for more info:

Install HBase


Uncompress and install at /opt/hbase


$ vi /opt/hbase/conf/hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

$ vi ~/.bashrc
export HBASE_HOME=/opt/hbase
export PATH= $PATH:$HBASE_HOME/bin

$ mkdir -p /data/hbase/hfiles
$ mkdir -p /data/hbase/zookeeper

$ vi /opt/hbase/conf/hbase-site.xml


/opt/hbase/bin$ ./start-hbase


$ hadoop fs -ls /hbase

See here for more info, https://hbase.apache.org/book.html#quickstart

hbase(main):001:0> create 'test', 'cf'
hbase(main):002:0> list 'test'

February 10, 2020

[Note] Installing Hadoop 2 on Ubuntu 18

Environment: Ubuntu 18 64-bit, Java 8 installed, host name=bigdata
This is a note on installing Hadoop2.

# optional: create user:group for hadoop.  I use hduser:hadoop

# set up password-less ssh, and do this:

# get  hadoop2 file, uncompress and install above in /opt/hadoop
wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz

# vi ~/.bashrc and add:
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

# vi /opt/hadoop/etc/hadoop/hadoop-env.sh, and add:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

# vi /opt/hadoop/etc/hadoop/hdfs-site.xml, and add:



# create data folders, and give appropriate ownership, group and permission for Hadoop process
sudo mkdir -p /data/hadoop/name_node
sudo mkdir -p /data/hadoop/data_node
sudo mkdir -p /data/hadoop/tmp

# vi /opt/hadoop/etc/hadoop/core-site.xml

# vi /opt/hadoop/etc/hadoop/yarn-site.xml

# vi /opt/hadoop/etc/hadoop/mapred-site.xml

Time to run:

$ . ~./bashrc
$ hdfs namenode -format 
$ start-dfs.sh
$ start-yarn.sh
$ jps


  • NameNode, http://bigdata:9870/
  • Data node, http://bigdata:9864/datanode.html
  • Resource Manager, http://bigdata:8088/


Hadoop Commands


February 4, 2020

[NOTE] Install DB2 (Express-C) on Ubuntu 18

This is for:
VirtualBox on Win10 + Ubuntu 18 + DB2 express-C 64-bit.

Headless setting takes too much effort, thus installed on Ubuntu 18 with GUI.

1. Download for Linux

Register IBM ID when you download the file.

(1) First go here,

(2) Search for "express-c"https://www.ibm.com/search?lang=en&cc=us&q=express-c
And click on DB2 Express-C database server

(Or try this link, https://www-01.ibm.com/marketing/iwm/iwm/web/pickUrxNew.do?source=swg-db2expressc&mhsrc=ibmsearch_a&mhq=express-c)

I downloaded 64 bit, v11.1.  File name is v11.1_linuxx64_expc.tar.gz

2. Install

(1) Uncompress that file

$ tar zxvf v11.1_linuxx64_expc.tar.gz

(2) set up

$ cd expc

Check if all required dependencies are installed.  If not, install them.
$ ./db2prereqcheck

Then run installer
$ ./db2_install

Takes a while, and will install at ~/sqllib

3. Set up, Start

Start DB2:

$ db2start

And need to set up a couple of things.  Run client:

$ db2

3.1. Create sample database:

db2 => create database sample
DB20000I  The CREATE DATABASE command completed successfully.

3.2. Set up port 50000:
SQL1362W  One or more of the parameters submitted for immediate modification
were not changed dynamically. Client changes will not be effective until the
next time the application is started or the TERMINATE command has been issued.
Server changes will not be effective until the next DB2START command.

3.3. Then set up TCP/IP

$ db2 get dbm cfg | grep SVCENAME
TCP/IP Service name                          (SVCENAME) =
SSL service name                         (SSL_SVCENAME) =

$ db2set DB2COMM=tcpip
$ db2set -all

To stop and start again and test

$ db2stop
SQL1064N  DB2STOP processing was successful.

$ db2start

$ db2 connect to sample user kkim using <password>
Other helpful commands, for diagnostic output:

$ db2diag


Make sure the port is defined in /etc/services file, if not add the line:
# vi /etc/services

Append this,

db2c_db2inst1   50000/tcp               # db2

Get driver from one of these sites:
I got one from IBM.  Uncompress the downloaded file (v11.1.4fp4a_jdbc_sqlj.tar.gz), and use these JARs: db2jcc.jar, db2jcc4.jar

JDBC String,


DB GUI clients


February 2, 2020

[Note] PySpark on Windows

Purpose: Not for production.  Learning/testing only.
Environment: Win10 64-bit
Pr-requisite: Java 8 or later, Anaconda

  1. Create an Anaconda environment.  Mine is called "develop" for Pyspark with Python 3.7
  2. activate develop (or your environment name)
  3. conda install pyspark jupyter
  4. Get 'winutils.exe' from https://github.com/steveloughran/winutils/tree/master/hadoop-2.8.1
  5. Store winutils.exe in C:\opt\spark-hadoop\bin  (or anywhere you like)
  6. Set up an environment variable, set HADOOP_HOME=C:\opt\spark-hadoop
  7. type "pyspark" to get Scala console.  For Jupyter, just type "jupyter notebook"

Install Win10 with local account

My son was installing Win10 on VM using VirtualBox today -- which is for testing out some downloaded free software.  Unfortunately, Win10 installer removed an option to use local account, but forces to use Microsoft account.

The workaround -- disable network adapter, and just keep going.  It'll create a local account instead.  Or, create MS account first, then later create a local account and switch.

New Chromium based Edge is great

Get it from here, https://www.microsoft.com/en-us/edge

Love the Read Aloud text-to-speech feature.  I can "listen" to news while doing something else, or "listen" to Wikipedia page.

Great job!