March 8, 2019

Notes on CUDA and Tensorflow

This is a note to myself.  I just had to re-install TensorFlow and wanted to put some notes for the record.

This is about installing CUDA, Anaconda, TensorFlow.




Environment:Win10 Pro 64-bit

I have old GPU graphics cards:
I got the Tesla for GPU programming many years ago before TensorFlow came out, and paid good money for it, but now it's only $60-$70 on eBay.

I still can use above old cards with frameworks other than just Tensorflow. So to be compatible with all my cards, I have to stick to CUDA 8 for the older cards.  The latest CUDA is 10.1, and requires latest GPUs. 

Due to some TensorFlow work I had to do last year, I bought somewhat latest graphics card for it:
TensorFlow GPU requires minimum compute capability of 3, Quadro P2000 performs decently for experimental work.  I used it along with AWS -- when I compared to AWS GPU small machine configuration with P2000; and duration of the work, and disk space, etc -- I found buying this graphics card is a good decision.  For larger projects with good budget, I would explorer AWS option.

So there are three graphics cards: two connected to actual monitors, and P2000 is just used for GPU only with TensorFlow.

To use TensorFlow with GPU for Quadro P2000, and also to be backward compatible with the older cards to use with other frameworks/C++, etc:
Next, install python 3.5 for TensorFlow v1.3 with CUDA8: Run Anaconda console.  By the way, I use ConEmu, so use this entry for Anaconda task:

%windir%\System32\cmd.exe "/K" C:\opt\Anaconda3\Scripts\activate.bat C:\opt\Anaconda3

And in Anaconda console, create environment for TensorFlow:

(base) C:\Users\kkim> conda create --name tensorflow python=3.5(base) C:\Users\kkim> activate tensorflow

Then install all the required packages:

(tensorflow) C:\Users\kkim>conda install pandas matplotlib jupyter notebook scipy scikit-learn numpy nb_conda pillow h5py pyhamcrest cython

Now, install TensorFlow and Keras, if you don't want Keras, you can just install TF only.  In order to install Keras, you have to follow this odd steps: install TF, install Keras, uninstall TF, and then install TF again.

It's because Keras needs TF to be installed, but after installing Keras, it messes up something and there will be an issue with TF.  So the solution to this problem is uninstall TF and re-install.  This will fix it.  See Reference#4:

(tensorflow) C:\Users\kkim>pip install keras
(tensorflow) C:\Users\kkim>pip install tensorflow-gpu==1.3
(tensorflow) C:\Users\kkim>pip uninstall tensorflow-gpu
(tensorflow) C:\Users\kkim>pip install tensorflow-gpu==1.3

Due to use of CUDA8, Only TF v1.3 can be used.  Later version of TF requires newer version of CUDA.

All done.  Now time to have fun with Jupyter and TF.  TF will use Quadro P2000 only, but with CUDA SDK and other frameworks can use all three cards.

Just a note...


Here is the output of  deviceQuery -- deviceQuery is an example program that comes with CUDA SDK from nVidia:

deviceQuery.exe Starting...
 

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 3 CUDA Capable device(s)

Device 0: "Quadro P2000"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 5120 MBytes (5368709120 bytes)
  ( 8) Multiprocessors, (128) CUDA Cores/MP:     1024 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             3504 Mhz
  Memory Bus Width:                              160-bit
  L2 Cache Size:                                 1310720 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 9 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla C2050"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 3072 MBytes (3221225472 bytes)
  (14) Multiprocessors, ( 32) CUDA Cores/MP:     448 CUDA Cores
  GPU Max Clock rate:                            1147 MHz (1.15 GHz)
  Memory Clock rate:                             1500 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 786432 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 5 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "Quadro 600"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1024 MBytes (1073741824 bytes)
  ( 2) Multiprocessors, ( 48) CUDA Cores/MP:     96 CUDA Cores
  GPU Max Clock rate:                            1280 MHz (1.28 GHz)
  Memory Clock rate:                             800 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 131072 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 6 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 3, Device0 = Quadro P2000, Device1 = Tesla C2050, Device2 = Quadro 600
Result = PASS


References

  1. https://ulrik.is/writing/keras-tensorflow-with-cuda-8-and-cudnn-on-windows-10/
  2. cuda version, tensorflow version match - https://stackoverflow.com/questions/50622525/which-tensorflow-and-cuda-version-combinations-are-compatible#50622526
  3. https://medium.com/@minhplayer95/how-to-install-tensorflow-with-gpu-support-on-windows-10-with-anaconda-4e80a8beaaf0
  4. Installing TensorFlow with Keras: https://github.com/keras-team/keras/issues/5776
  5.  CUDA version, tensorflow version match - https://stackoverflow.com/questions/50622525/which-tensorflow-and-cuda-version-combinations-are-compatible#50622526
  6. https://medium.com/@minhplayer95/how-to-install-tensorflow-with-gpu-support-on-windows-10-with-anaconda-4e80a8beaaf0

No comments: