February 16, 2020

Big Data Technologies

*Descriptions are from each technology's site.

I did some work on Hadoop around 2008-2010.  It's 2020, and seems like I might have to use it again.  Hadoop was painful to use back then.  It's improved a lot and there are now a lot of good tools around it.  At least the hype died down, and more people have experience with it, thus more realistic view on big data and Hadoop.

Here, I'm listing big data and Hadoop related technologies as there are so many of them.



Hadoop - a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
http://hadoop.apache.org

HDFS - distributed file system

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://cwiki.apache.org/confluence/display/HADOOP2/HDFS/

Hadoop MapReduce - A YARN-based system for parallel processing of large data sets.
https://en.wikipedia.org/wiki/MapReduce
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

BigTop - for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. Bigtop supports a wide range of components/projects, including, but not limited to, Hadoop, HBase and Spark.

http://bigtop.apache.org

YARN - framework for job scheduling and cluster resource management
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Hadoop Ozone - An object store for Hadoop.
https://hadoop.apache.org/ozone/

Hadoop Submarine - A machine learning engine for Hadoop.
https://hadoop.apache.org/submarine/

Ambari - A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
https://ambari.apache.org

Avro - A data serialization system.
https://avro.apache.org

Parquet - columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
https://parquet.apache.org/

Orc - smallest, fastest columnar storage for Hadoop. Includes support for ACID transactions and snapshot isolation
https://orc.apache.org

arrow - cross-language development platform for in-memory data; Columnar memory
https://arrow.apache.org/

Impala - native analytic database for Apache Hadoop
http://impala.apache.org/index.html

Flink - Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams
https://flink.apache.org/usecases.html

Kudu - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer.
https://kudu.apache.org/

Cassandra - A scalable multi-master database with no single points of failure.
https://cassandra.apache.org

Chukwa - A data collection system for managing large distributed systems.
https://chukwa.apache.org

HBase - A scalable, distributed database that supports structured data storage for large tables.

Column oriented, non-RDBMS on top of Hadoop.
https://hbase.apache.org

Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying. Data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.
https://hive.apache.org

Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
https://flume.apache.org/download.html

Mahout - A Scalable machine learning and data mining library.
A distributed linear algebra framework and mathematically expressive Scala DSL designed to let mathematicians, statisticians, and data scientists quickly implement their own algorithms. Apache Spark is the recommended out-of-the-box distributed back-end, or can be extended to other distributed backends.
https://mahout.apache.org

Pig - A high-level data-flow language and execution framework for parallel computation.
A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
https://pig.apache.org

Oozie - a workflow scheduler system to manage Apache Hadoop jobs
https://oozie.apache.org/

Spark - a unified analytics engine for large-scale data processing.
Can be used for: ETL, machine learning, stream processing, and graph computation.
https://spark.apache.org

Tez - an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.
https://tez.apache.org

ZooKeeper - a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
https://zookeeper.apache.org

mesos - built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire datacenter and cloud environments.
http://mesos.apache.org/

sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
https://sqoop.apache.org

February 15, 2020

Notes on Parquet and ORC

ORC (Optimized Row Columnar)
  • falttened data
  • light weight index + bloom filter
  • better compression
  • Better with Hive
  • much less GC

Parquet
  • Nested data
  • Better with Spark

Note is in progress

Serverless: fnproject, OpenLambda

As far as I remember, Google is the first commercial service providing serverless computing environment.  It wasn't easy to use in the beginning, but it was very well scalable.  Not sure if it was cost effective though.

For a start up, it was very reasonable environment.  That was before container like Docker, and before AWS was popular.  My friend's startup back then used it for its backend.  As the product started to get spikes, scaling it was extremely simple. (but wasn't cheap.)

I've been waiting for cheap-affordable serverless computing environment for personal use.  It doesn't have to be distributed or cloud environment.  I just wanted a simple environment that I could run on one node or more.  And I just deploy a simple applications to it -- somewhat similar concept of dropping WAR in Tomcat, or like OSGi (e.g. Apache Felix - http://felix.apache.org/).  But they are Java only, and not for my purpose (small, simple, light apps).

After numerous searching, I found two serverless implementation that can run in a small environment, or even on a single node: fnproject and OpenLambda  Both of them are open source.

fnproject


fn is very mature and complete project, supporting many languages (Go, Java, Node, Python, Ruby).  It was very simple to use and deploy...but the performance wasn't great.  I believe the implementation is using docker container, and container inside of container.

Requires docker, golang

PROS: simple and easy to install.  Support multiple languages
CONS: slow

OpenLambda


This isn't as mature/complete as fn.  And I failed to install this on Ubuntu 18 VM.  Don't know why.  Compilation was done fine, but failed for 'make test-all'.  To compile, docker, gcc, make, and the latest golang is needed. (v12, v13 worked).  but test failed.  This is probably the same way as it requires docker; container in container approach.

PROS: couldn't test
CONS: n/a

Misc


IronFunctions, https://github.com/iron-io/functions - seems like development has stopped.  Looks like using the same approach, using Docker.  Didn't try this, and you shouldn't either.  Just listing here in case someone is interested.

Conclusion


So the search continues.  I'll probably use AWS lambda, the cost for small experiment/use shouldn't be too much.


REFERENCES



[Note] old S/W - SparkleShre, Detachtty, QCL, Quipper

[This is just to leave a record.]

While cleaning up an old VM (CentOS 6), found some software I played around several years ago:

SparkleShare - using OwnCloud instead - https://owncloud.org/, blogged here - https://blog.keithkim.com/2018/10/public-cloud-storage-and-owncloud-for.html

Detachtty - no more LISP web app

Quantum computer - using https://qiskit.org/ instead, as blogged here - https://blog.keithkim.com/2019/11/quantum-computer-getting-started-qiskit.html

There were also BigTop, Jenkins, other Lisp (personal) projects on that VM.  I'm now installing another dev VM with CentOS 8.



Technologies come and go so fast.

February 14, 2020

[Note] Install Mahout

Environment: Ubuntu 18 64-bit Server
Prerequisite: Hadoop


Down

$ wget http://archive.apache.org/dist/mahout/0.13.0/apache-mahout-distribution-0.13.0.tar.gz

Uncompress and install at /opt/mahout

$ vi .bashrc
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export MAHOUT_HOME=/opt/mahout
export PATH=$PATH:$MAHOUT_HOME/bin
# use local
#export MAHOUT_LOCAL=true
# use with hadoop
export MAHOUT_LOCAL=""

$ . ~/.bashrc


Test

$ mkdir test; cd test
$ wget http://www.grouplens.org/system/files/ml-100k.zip
$ unzip ml-100k.zip
$ cd ml-100k
$ hdfs dfs -mkdir -p /user/hduser
$ hadoop fs -put u.data u.data

$ mahout recommenditembased -s SIMILARITY_COOCCURRENCE –input u.data –output output.txt


More Info

https://www.tutorialspoint.com/mahout/mahout_introduction.htm


February 13, 2020

[Note] Installing Hive, HBase

Environment: Ubuntu 18 Server 64-bit on VirtualBox host Windows 10
Prerequisite: Hadoop Install

Install Hive

Download, http://apache-mirror.8birdsvideo.com/hive/hive-2.3.6/apache-hive-2.3.6-bin.tar.gz

Uncompress and install in /opt/hive

Prep env vars

$ vi ~/.bashrc
# --- hive
export HIVE_HOME=/opt/hive
export HIVE_CONF_DIR=$HIVE_HOME/conf
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$(hadoop classpath)
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/*


$ . ~/.bashrc

Copy and set environment script

$ cd /opt/hive/conf
$ cp hive-env.sh.template hive-env.sh
$ vi hive-env.sh
HADOOP_HOME=/opt/hadoop


Copy default config

$ cd /opt/hive/conf
$ cp hive-default.xml.template hive-default.xml


Prep Hadoop FS

$ hdfs dfs -mkdir -p /user/hduser/warehouse
$ hdfs dfs -mkdir /tmp
$ hdfs dfs -chmod g+w /user/hduser/warehouse
$ hdfs dfs -chmod g+w /tmp


Prep metadata db

$ schematool -initSchema -dbType derby


Test

$ hive
hive> create database test;
hive> show databases;


See here for more info:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-InstallationandConfiguration


Install HBase

Download,
https://hbase.apache.org/downloads.html
https://www-us.apache.org/dist/hbase/hbase-1.4.12/hbase-1.4.12-bin.tar.gz

Uncompress and install at /opt/hbase

Setup

$ vi /opt/hbase/conf/hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64


$ vi ~/.bashrc
export HBASE_HOME=/opt/hbase
export PATH= $PATH:$HBASE_HOME/bin

$ mkdir -p /data/hbase/hfiles
$ mkdir -p /data/hbase/zookeeper

$ vi /opt/hbase/conf/hbase-site.xml
<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:9000/hbase</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/data/hbase/zookeeper</value>
    </property>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>
</configuration>


Start

/opt/hbase/bin$ ./start-hbase

Test

$ hadoop fs -ls /hbase

See here for more info, https://hbase.apache.org/book.html#quickstart

hbase(main):001:0> create 'test', 'cf'
hbase(main):002:0> list 'test'



February 10, 2020

[Note] Installing Hadoop 2 on Ubuntu 18


Environment: Ubuntu 18 64-bit, Java 8 installed, host name=bigdata
This is a note on installing Hadoop2.


# optional: create user:group for hadoop.  I use hduser:hadoop

# set up password-less ssh, and do this:
ssh 0.0.0.0

# get  hadoop2 file, uncompress and install above in /opt/hadoop
wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz

# vi ~/.bashrc and add:
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}/bin
export HADOOP_COMMON_HOME=${HADOOP_HOME}/bin

export HADOOP_HDFS_HOME=${HADOOP_HOME}/bin
export YARN_HOME=${HADOOP_HOME}/bin
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"


# vi /opt/hadoop/etc/hadoop/hadoop-env.sh, and add:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

# vi /opt/hadoop/etc/hadoop/hdfs-site.xml, and add:
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/data/hadoop/name_node</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/data/hadoop/data_node</value>
        <final>true</final>
    </property>
</configuration>


# create data folders, and give appropriate ownership, group and permission for Hadoop process
sudo mkdir -p /data/hadoop/name_node
sudo mkdir -p /data/hadoop/data_node
sudo mkdir -p /data/hadoop/tmp

# vi /opt/hadoop/etc/hadoop/core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/data/hadoop/tmp</value>
    </property>
</configuration>


# vi /opt/hadoop/etc/hadoop/yarn-site.xml
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>


# vi /opt/hadoop/etc/hadoop/mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>


Time to run:

$ . ~./bashrc
$ hdfs namenode -format 
$ start-dfs.sh
$ start-yarn.sh
$ jps


UI

  • NameNode, http://bigdata:9870/
  • Data node, http://bigdata:9864/datanode.html
  • Resource Manager, http://bigdata:8088/

Resources

Hadoop Commands

Tools



February 4, 2020

[NOTE] Install DB2 (Express-C) on Ubuntu 18


This is for:
VirtualBox on Win10 + Ubuntu 18 + DB2 express-C 64-bit.


Headless setting takes too much effort, thus installed on Ubuntu 18 with GUI.


1. Download for Linux

Register IBM ID when you download the file.

(1) First go here,
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.db2.luw.qb.server.doc/doc/r0054460.html

(2) Search for "express-c"https://www.ibm.com/search?lang=en&cc=us&q=express-c
And click on DB2 Express-C database server

(Or try this link, https://www-01.ibm.com/marketing/iwm/iwm/web/pickUrxNew.do?source=swg-db2expressc&mhsrc=ibmsearch_a&mhq=express-c)

I downloaded 64 bit, v11.1.  File name is v11.1_linuxx64_expc.tar.gz

2. Install

(1) Uncompress that file

$ tar zxvf v11.1_linuxx64_expc.tar.gz

(2) set up

$ cd expc

Check if all required dependencies are installed.  If not, install them.
$ ./db2prereqcheck

Then run installer
$ ./db2_install

Takes a while, and will install at ~/sqllib

3. Set up, Start

Start DB2:

$ db2start

And need to set up a couple of things.  Run client:

$ db2

3.1. Create sample database:

db2 => create database sample
DB20000I  The CREATE DATABASE command completed successfully.

3.2. Set up port 50000:
 
db2 => UPDATE DBM CFG USING SVCENAME 50000
DB20000I  The UPDATE DATABASE MANAGER CONFIGURATION command completed
successfully.
SQL1362W  One or more of the parameters submitted for immediate modification
were not changed dynamically. Client changes will not be effective until the
next time the application is started or the TERMINATE command has been issued.
Server changes will not be effective until the next DB2START command.

3.3. Then set up TCP/IP

$ db2 get dbm cfg | grep SVCENAME
TCP/IP Service name                          (SVCENAME) =
SSL service name                         (SSL_SVCENAME) =

$ db2set DB2COMM=tcpip
$ db2set -all
[i] DB2COMM=TCPIP
[g] DB2_COMPATIBILITY_VECTOR=MYS


To stop and start again and test

$ db2stop
SQL1064N  DB2STOP processing was successful.

$ db2start

$ db2 connect to sample user kkim using <password>
Other helpful commands, for diagnostic output:

$ db2diag

4. JDBC

Make sure the port is defined in /etc/services file, if not add the line:
# vi /etc/services

Append this,

db2c_db2inst1   50000/tcp               # db2


Get driver from one of these sites:
I got one from IBM.  Uncompress the downloaded file (v11.1.4fp4a_jdbc_sqlj.tar.gz), and use these JARs: db2jcc.jar, db2jcc4.jar

JDBC String,
jdbc:db2://myhost:50000/sample


References


DB GUI clients

 

February 2, 2020

[Note] PySpark on Windows

Purpose: Not for production.  Learning/testing only.
Environment: Win10 64-bit
Pr-requisite: Java 8 or later, Anaconda

  1. Create an Anaconda environment.  Mine is called "develop" for Pyspark with Python 3.7
  2. activate develop (or your environment name)
  3. conda install pyspark jupyter
  4. Get 'winutils.exe' from https://github.com/steveloughran/winutils/tree/master/hadoop-2.8.1
  5. Store winutils.exe in C:\opt\spark-hadoop\bin  (or anywhere you like)
  6. Set up an environment variable, set HADOOP_HOME=C:\opt\spark-hadoop
  7. type "pyspark" to get Scala console.  For Jupyter, just type "jupyter notebook"

Install Win10 with local account

My son was installing Win10 on VM using VirtualBox today -- which is for testing out some downloaded free software.  Unfortunately, Win10 installer removed an option to use local account, but forces to use Microsoft account.

The workaround -- disable network adapter, and just keep going.  It'll create a local account instead.  Or, create MS account first, then later create a local account and switch.


New Chromium based Edge is great

Get it from here, https://www.microsoft.com/en-us/edge

Love the Read Aloud text-to-speech feature.  I can "listen" to news while doing something else, or "listen" to Wikipedia page.

Great job!