OpenCL in Commercial Environments
OpenCL is the heterogeneous computing framework that enables execution across different platforms such as CPUs and GPUs. While its design was meant so that you can write your program once and have it execute among different platforms, you may run into problems while implementing your designs in a commercial product. These problems include from lack of OpenCL 1.2 support, driver instability, poor debugging tools, concurrency problems in the API, cumbersome API, and limited SDK examples. Should you choose OpenCL for your commercial product? This article hopes to answer some of those questions.
To avoid any confusion, lets define some terms and put them into context:
- GPU- Graphics Processing Unit
- Kernel- Code in the language of either CUDA or OpenCL that will be compiled and executed by the GPU device
- OpenCL- Open Computing Library; allows the same kernel to run on either CPU or GPU where the device manufacturer or platform is immaterial
- CUDA- Compute Unified Device Architecture; the C language extension made specifically for NVidia GPU devices
MotivationSo management or someone in your organization has made the decision to make OpenCL the choice framework for developing XYZ algorithm or to extend current XYZ algorithm to take advantage of the GPU for a better user experience. Whatever the reasons, you most likely chose OpenCL due to its agnosticism towards hardware platforms and/or vendors. Your choice, however, needs pragmatic consideration; which in the following sections, I'll attempt to share my experience in a real-world commercial multi-process OpenCL development environment.
OpenCL 1.2 DriversStarting with drivers, NVidia as of April 26, 2014 only supports OpenCL 1.1 in their GPU drivers. NVidia has largely remained silent on the support of 1.2 drivers. So what will you be missing out?
Consider the following features added in OpenCL 1.2 not present in 1.1:
- Support for 3-D surfaces/images read and write- this is one of the most important features added to OpenCL 1.2. For scientific applications that require hardware interpolation of voxels (i.e nearest neighbour or linear), this feature will be important to you. Only read is supported in OpenCL 1.1. Currently, AMD is one of the few that deliver 1.2 specification in their GPUs. Memory layouts are also quire versatile such as 32-bit float and normalized chars.
- Offline compilation and linking of kernels- AMD offered this as an extension since 1.1 but since then, this has been standardized in version 1.2.
- Memset- You might have been frustrated that a simple operation such as setting memory required you to write your own kernel in version 1.1. In version 1.2, you have clEnqueueFillBuffer and clEnqueueFillImage to fill a buffer or image respectively.
- Double precision- Double precision was only an extension in 1.1 but now 1.2 has made it part of the standard.
Page 378 of the standard list the changes in further detail 
As for the driver problems, NVidia did something mysterious to their drivers after version 320.18. In a commercial application, you will need to target a specific drivers to do all your testing. My experiences with unstable drivers has led to long hours of debugging only to find out the driver was the culprit. Therefore, I emphasize One Driver support for One Device in a commercial application. After all, you don't want to support multiple drivers when your product XYZW is in the field and needs servicing do you?
Want to peep inside your kernel and step through line by line in a debugger? I find this the most disappointing part of the OpenCL framework for GPUs.
You have the following options fog GPU debugging:
- CodeXL- Works only on AMD devices. Very buggy (as of April 26, 2014 latest version) and unstable.
- NSight- Does not do debugging but will profile your kernels.
My experience with CodeXL has not been really good. Where I worked, we had complex kernels to debug with several hundreds of lines of code. Perhaps simple kernels might work but I find forums riddled with complains about how buggy CodeXL is and find that to be very discouraging.
Concurrency IssuesWill your application make use of OpenCL in the following manner? :
- Multiple Processes each with their own kernels manipulating data and sending the result to other OpenCL processes?
- Single Process with multiple OpenCL kernel algorithms?
Although, the specification for OpenCL specifies which of their functions are thread safe, some that are labelled as thread safe don't seem to be ...
Here's what to avoid:
- Avoid calling clGetPlatformIDs from multiple threads- when the platforms are being queried, only one thread may call this function. Any other thread attempting to call this function while the other is still using it will not receive the full platform list. The mentioned behaviour is present in both NVidia and AMD GPU devices.
- Memory Segmentation with multiple processes- GPU memory is segmented by the device driver. Avoid having multiple processes occupy more memory than what is available on the device. You will may run into a case where memory is pre-empted out of the device back between processes to make room for computation. I've seen this scenario when an algorithm that took a few seconds wound up taking several minutes because two processes were competing for GPU memory resources.
The above two problems are not easy to solve. The complexity of adding a "GPU manager" is really labour intensive and prone to lots errors. Thus, we are left to execute memory intensive operations synchronously as a simple solution.
OpenCL APIAre you still in the OpenCL bandwagon ? If you have not been intimidated by the above issues then you will need to consider this section. I have found the OpenCL API when compared to CUDA to be excessively verbose. I can only point to so many examples but try this:
- Passing arguments to kernels- Imagine having 11 arguments. You will have to call clSetKernelArg 11 times! This is particularly troublesome if one function sets arguments 1 through 9 and the other 0 through 10. It can lead to a discontinuous setting of arguments.
- Initializing a device- Getting a context to a device isn't that friendly either. Remember, OpenCL is designed to work for multiple platforms and devices that have multiple vendors! Initializing the device to use is verbose by itself but thankfully, there are many examples to guide you
- Libraries- As you know, CUDA and OpenCV have been around longer than OpenCL. Given the friendlier nature of CUDA, you will find more code library and utilities readily available for CUDA.
PerformanceIs CUDA and OpenCL equivalent in performance? Short answer: yes. In my experience with 3-D imagery, I have not seen a case where CUDA outperformed OpenCL. NVidia is, however, free to optimize CUDA anyway they want to get a better performance boost with the CUDA driver vs the OpenCL driver. Whether they do it or not, I don't know.
FFT LibrariesWhen doing scientific or signal processing work, we typically use the Fast Fourier Transfer to convert data from the linear domain to the frequency domain. We need a good library to do that. The good news is the for OpenCL, there is a free library that can be used for commercial purposes.
Check out Apple's OpenCL FFT library :
ConclusionMy intention was to share my OpenCL experience that I went through in a real commercial application. I hope that you take all angles into consideration when choosing between OpenCL and CUDA (If that is your choice to make). You must answer sincere questions: "Is it really necessary to have a platform independent framework?", "Will OpenCL survive the fast changing world of massively parallel computing?", "What is the lifetime of my application?" , just to name a few. Please write your comments below and let me know of any mistakes I've made. Thanks for reading!