This question has been raised a number of times across the years. The first time I faced it was with JDK incorporating lcms1. They had a test profile that was somehow giving bad results when optimized. After close inspection, the profile was found to be operating in linear XYZ space. The complaint was almost the same: when I use linear space as input, optimization doesn’t work. Otherwise, all other combinations are ok. And this seems to be also the actual case; all other combinations are working fine. Try for example to reverse the transform going from regular sRGB to your linear space, you will find all the issue are now gone.
But anyway, there is this case that seems to fail. And that’s true: on this particular case, some dark shadows got a dE > 1.5 if using the default settings. Fine, it happens that for this extreme case, the defaults don’t work. This is the reason why there is called “default” and there is a setting to control it.
So, the short answer is: don’t optimize when using a linear XYZ space as input in 16 bits transforms.
But I guess you want also the long answer. So, here you go.
When you use lcms to create a color transform joining two or more profiles, you are creating a devicelink profile. You don’t see it as a file; it lives in memory, and is destroyed when you delete the transform. But it is there.
Devicelinks can be implemented in different ways, for example they can be implemented as a set of curves, or by a matrix, or by a 3D CLUT table, or by a combination of all elements above. Some of those ways are better than others in terms of xput, others are better in terms of image quality. CMMs have to “guess” which is the best combination of elements for a given set of profiles. There is balance between quality and performance. For some corner cases, optimizing for speed can effectively introduce defects on quality.
The best devicelink representation often depends on the true nature of the space described by the profiles. Specially the input space, But then… the profile only gives you a way to convert form/to Lab to its space, and gives no other clue about the space nature.
An example of ill-formed spaces are those that are operating in linear XYZ gamma space. You should NEVER user linear gamma to store your 8-bit images. Why? Because in 8 bits you have 256 levels, and in linear gamma the separation between those levels is not perceptually uniform. That means you have very few levels to encode the effective dynamic range of your image and many levels are wasted in highlights. Hold on, you would say, RAW images are encoded in linear gamma and they work quite well, isn’t it? You are right… but I said 8 bits, remember? If you move to 16 bits or floating point, you can still use linear encoding, but with some care.
Back to our methods to encode devicelinks. One used by lcms when the transform converts form 16 bits to 16 bits, is a CLUT table. This is just a 3D (or 4D in CMYK) grid with nodes. Pixel values are interpolated across nodes. For example, the distortion you get when going from sRGB to AdobeRGB is stored in a 3D grid of 17 nodes on each R, G, B side. When a pixel arrives, the corresponding nodes that enclose the value are selected and the result is interpolated. In our 17 node example, a value of, say, (100, 100, 100) will go on the 100*(17-1)/255 = 6.7 so the nodes 6 and 7 of each side will be taken for interpolation.
Le’s now take a linear space. Since as said, many colors are collapsed to a relatively few codes due to the gamma encoding, almost all dynamic range is confined to few nodes. That mean In a 17 nodes grid, most image dynamic range will fall in 5 or 6 nodes. And this is the reason you got posterization in shadows: most of dark tones falls in just 1-2 nodes and linear interpolation cannot deal with the non-linear nature of the transform linear-gamma 2.2.
How to solve this? The most evident way is to not use 3D CLUT optimization. The CMM already does that if you use floating point, or if you use 8 bits. In lcms2.03 there is some experimental flag that tries to solve this issue adding an extra tone curve cmsFLAGS_CLUT_POST_LINEARIZATION and cmsFLAGS_CLUT_PRE_LINEARIZATION. I have checked that and found to solve this issue as well.
So, that is the reason why you only see this issue when converting from 16 bits to 16 bits with default flags. Placing a NOOPTIMIZE in all transforms would prevent problems, but at big performance penalty that is hard to explain just to fix this specific case. It is like you have a Ferrari but you go always at 25Mph just because once upon a time you faced a winding road.
My recommendation for programmers would be to allow end user to turn optimization off for general usage, or at least to provide a specialized workflow for RAW handling with optimizations turned off, that is the only place when linear XYZ makes sense. For users, I would recommend to NEVER use linear XYZ spaces. They are good for nothing, nor for storage, nor for image processing. The very few algorithms that need to be done in linear can do and undo the conversion when processing. But anyway, there are people with strong opinions on this field and everybody is free to do whatever they want. This is just a recommendation, please don’t take it as a stone-engraved truth.