Mostly we’re very happy with it. Among its many environments OpenShift offers a JBoss EAP 6.x server that’s very regularly updated. JBoss EAP 6.x is Red Hat’s Java EE 6 implementation that has received many bug fixes during the years so it’s rather stable at the moment. And even though Red Hat has a Java EE 7 implementation out (WildFly 8) the Java EE 6 server keeps getting bug fixes to make it even more stable.
Yesterday however both the two nodes on which we have our showcase app deployed appeared to be suddenly down. An attempt to restart the app via the webconsole didn’t do anything. It just sits there for a long time and then eventually says there is a technical problem and doesn’t provide any further details. This is unfortunately one of the downsides of OpenShift. It’s a great platform, but the webconsole clearly lags behind.
We then tried to log into our primary gear using ssh [number]@[our app name].rhcloud.com. This worked, however the Jboss instances are not running on this primary gear but on two other gears. We tried the “ctl_all stop” and “ctl_all start” commands, but this only seemed to restart the cartridges (ha-proxy and a by default disabled JBoss) on the gear where we were logged-in, not on the other ones.
Next step was trying to login into those other gears. There is unfortunately little information available on what the exact address of those gears is. There used to be a document up at https://www.openshift.com/faq/can-i-access-my-applications-gear, but for some reason it has been taken down. Vaguely remembering that the URL address of the other gears is based on what [app url]/haproxy-status lists, we tried to ssh to that from the primary gear but “nothing happened”. It looked like the ssh command was broken. ssh’ing into foo (ssh foo) also resulted in nothing happening.
With the help of the kind people from OpenShift at the IRC channel it was discovered that ssh on the openshift gear is just silent by default. With the -v option you do get the normal response. Furthermore, when you install the rhc client tools locally you can use the following command to list the URL addresses of all your gears:
rhc app show [app] --gears
This returns the following:
ID State Cartridges Size SSH URL [number1] started jbosseap-6 haproxy-1.4 small [number1]@[app]-[domain].rhcloud.com [number2] started jbosseap-6 haproxy-1.4 small [number2]@[number2]-[app]-[domain].rhcloud.com [number3] started jbosseap-6 haproxy-1.4 small [number3]@[number3]-[app]-[domain].rhcloud.com
We can now ssh into the other gears using the [numberX]@[numberX]-[domain].rhcloud.com pattern, e.g.
In our particular case on the gear identified by [number2] the file system was completely full. Simply deleting the log files from /jbosseap/logs fixed the problem. After that we can use the gear command to stop and start the JBoss instance (ctl_all and ctl_app seem to be deprecated):
gear stop gear start
And lo and behold, the gear came back to life. After doing the same for the [number3] gear both two nodes were up and running again and requests to our app were serviced as normal.
One thing that we also discovered was that per default OpenShift installs and starts a JBoss instance on the gear that hosts the proxy, but for some reason that probably only that one proverbial engineer that left long ago knows, there is no traffic routed to that JBoss instance.
In the ./haproxy/conf directory there’s a configuration file with among others the following content:
server gear-[number2]-[app] ex-std-node[node1].prod.rhcloud.com:[port1] check fall 2 rise 3 inter 2000 cookie [number2]-[app] server gear-[number3]-[app] ex-std-node[node2].prod.rhcloud.com:[port2] check fall 2 rise 3 inter 2000 cookie [number3]-[app] server local-gear [localip]:8080 check fall 2 rise 3 inter 2000 cookie local-[number1] disabled
As can be seen, there’s a disabled marker after the local-gear entry. Simply removing it and stopping/starting or restarting the gear will start routing requests to this gear as well.
Furthermore we see that the gear’s SSH URL can indeed be derived from the number that we see in the configuration and output of haproxy. The above [number2] is exactly the same number [number2] as was in the output from rhc app show showcase –gears.
This all took quite some time to figure out. How could OpenShift have done better here?
- Not take down crucial documentation such as https://www.openshift.com/faq/can-i-access-my-applications-gear.
- List all gear URLs in the web console when the application is scaled, not just the primary one.
- Implement a restart in the web console that actually works, and when a failure occurs gives back a clear error message.
- Have a restart per gear in the web console.
- List critical error conditions per gear in the web console. In this case “disk full” or “quota exceeded” seems like a common enough condition that the UI could have picked this up.
- Have a delete logs (or tidy) command in the web console that can be executed for all gears or for a single gear.
- Don’t have ssh on the gear in super silent mode.
- Have the RHC tools installed on the server. It’s weird that you can see and do more from the client than when logged-in to the server itself.
All in all OpenShift is still a very impressive system that lets you deploy completely standard Java EE 6 archives to a very stable (EAP) version of JBoss, but when something goes wrong it can be frustrating to deal with the issue. The client tools are pretty advanced, but the tools that are installed on the gear itself and the web console are not there yet.