What happened October 29th 2015?

The Shibboleth Docker container is composed of 14 seperate configuration files, and a magnituted of Dockerfile and startup scripts. Modification of any of these files requires a build of the Dockerfile and relaunch of the container. On a normal day, the rebuild would generate the same credentials (certificates) as was previously used. This day however was Halloween eve eve, and on rebuild of the container, Shibboleth generated fresh shiny new certificates likley due to old certificates being to old.

So we had a backup right?

Being that this service is running in Docker, a backup would have been easy as docker commit -p shib shib-backup. However this command was never run. Being this Docker container is running in a virtual machine, a snapshot on the hard disk would have been easy too, but that wasn't done either. This service, after almost 2 months of development, was seen as working and left alone to run peicefully in production. It was never finished, we didn't have a 5 minute recovery plan.

Why did it take so long to get working again?

Mostly error logs that weren't clear and were misleading. These logs sent me down some terrible rabbit holes as they were undocumented for Shibboleth, and led me to Shibboleth dependencies like Opensaml and Bouncycastle Cryto. I knew the issue was due to certificates, but nothing led me to beleiving they were simply regenerated.

2015-10-29 19:37:55,415 - ERROR [net.shibboleth.utilities.java.support.security.DataSealer:181] - mac check in GCM failed  
2015-10-29 19:37:55,421 - ERROR [org.opensaml.storage.impl.ServletRequestScopedStorageService:369] - Exception unwrapping sec  
ured data  
net.shibboleth.utilities.java.support.security.DataSealerException: Exception unwrapping data  
        at net.shibboleth.utilities.java.support.security.DataSealer.unwrap(DataSealer.java:182) 
Caused by: org.bouncycastle.crypto.InvalidCipherTextException: mac check in GCM failed  
        at org.bouncycastle.crypto.modes.GCMBlockCipher.doFinal(Unknown Source) 
2015-10-29 19:37:55,427 - ERROR [org.opensaml.storage.impl.ServletRequestScopedStorageService:501] - Error loading data from  
cookie, starting fresh  
java.io.IOException: Exception unwrapping secured data  
        at org.opensaml.storage.impl.ServletRequestScopedStorageService.load(ServletRequestScopedStorageService.java:370) 
Caused by: net.shibboleth.utilities.java.support.security.DataSealerException: Exception unwrapping data  
        at net.shibboleth.utilities.java.support.security.DataSealer.unwrap(DataSealer.java:182) 
Caused by: org.bouncycastle.crypto.InvalidCipherTextException: mac check in GCM failed  
        at org.bouncycastle.crypto.modes.GCMBlockCipher.doFinal(Unknown Source) 

This error makes it seem like there are file permissions issues, corruption in certificates, or even bugs within shibboleth code. Most issues pertaining to these errors were commits from Shibboleth development.

Shibboleth does not automatically generate metadata as I thought and was told by venders. My understanding was on setup, the metadata was generated using it's current certificates. This couldn't be farther from the truth. In fact the metadata is only for the purpose of vendors, Shibboleth does not care what is in it. Shibboleth ships with a metadata template, but leaves it to the deployer to write and customize it. This is actually true with most files in Shibboleth, there are 14 different XML files that need to be written practically from scratch, and include some very intricate logic. The deployer must know what their intended outcome is, and write all these files in unison to work as intended. For our setup, these configurations are found in the conf/ folder of the docker image.

I had a backup of the previous metadata file, and it matched what the server was currently spitting out. My confusion was due to this misunderstanding of metadata generation, so I didn't look to much deeper and thought the certificates must still be unchanged. I even tried to force Shibboleth to use old certificate files I had backed up, but this was without avail.

What was the solution?

I contacted our Schoology support, and 5 hours later I was able to get through to their SAML deployer. He turned debugging on at his end, but we were still seeing errors that did not convey that the certificates simply did not match.

SimpleSAML_Error_Error: UNHANDLEDEXCEPTION  
Backtrace:  
0 /var/www/simplesamlphp_peninsulasso/www/module.php:180 (N/A)  
Caused by: Exception: Unable to validate Signature  
Backtrace:  
6 /var/www/simplesamlphp_peninsulasso/vendor/simplesamlphp/saml2/src/SAML2/Utils.php:158 (SAML2_Utils::validateSignature)  
5 [builtin] (call_user_func)  
4 /var/www/simplesamlphp_peninsulasso/vendor/simplesamlphp/saml2/src/SAML2/Message.php:212 (SAML2_Message::validate)  
3 /var/www/simplesamlphp_peninsulasso/modules/saml/lib/Message.php:195 (sspmod_saml_Message::checkSign)  
2 /var/www/simplesamlphp_peninsulasso/modules/saml/lib/Message.php:504 (sspmod_saml_Message::processResponse)  
1 /var/www/simplesamlphp_peninsulasso/modules/saml/www/sp/saml2-acs.php:96 (require)  
0 /var/www/simplesamlphp_peninsulasso/www/module.php:135 (N/A)  

Researching this error led to threads and blog posts for vendors that wanted explicit signiture validation certificates. Needing the code below to be added to our conf/relying_partys.xml file. This couldn't have been needed however because this file hadn't changed since the meltdown, and the intended use of this funcionality is that the vendor provides the certificate for signature, which Schoology does not.

<MetadataProvider id="TuakiriMetadata"  
                  xsi:type="FileBackedHTTPMetadataProvider"
              refreshDelayFactor="0.125"
              maxRefreshDelay="PT2H"
              httpCaching="memory"
              backingFile="%{idp.home}/metadata/tuakiri-metadata.xml"
              metadataURL="https://directory.tuakiri.ac.nz/metadata/tuakiri-metadata-signed.xml">

        <MetadataFilter xsi:type="SignatureValidation"
                certificateFile="${idp.home}/credentials/tuakiri-metadata-cert.pem"
                requireSignedMetadata="false">
        </MetadataFilter>
        <MetadataFilter xsi:type="EntityRoleWhiteList">
                <RetainedRole>md:SPSSODescriptor</RetainedRole>
        </MetadataFilter>

</MetadataProvider>  

After experimenting with different configurations, ultimatly to roll back to what worked before, we tried swapping the certificates for the certificates currently inside the Shibboleth folder. This worked and login to Schoology was restored. The new certificates had to be modified on their end, I had no way to swap certificate configurations for Schoology.

What happens next?

Documentation. VM snapshots. Docker commits. Testing environments. Intigration testing. Homegrown SAML SP (Service Provider) for testing/debugging. Take a look at the Redundency blog post for more information.