Emergency Maintenance Complete
Tuesday, January 27, 2015 - 11:34pm
All Production CARLI Linux servers have been patched and rebooted.
Emergency Maintenance Starting
Tuesday, January 27, 2015 - 9:06pm
Due to a new Linux glibc exploit called "GHOST" that could be a serious threat to the security of our systems, Production CARLI servers/services will now start going offline to be patched and rebooted.
I will post another update when this work has been completed. If a service is offline or not working properly after emergency maintenance is completed, please contact firstname.lastname@example.org.
VuFind Performance this Morning (Monday, Jan 12)
Monday, January 12, 2015 - 2:22pm
This morning at around 9:20AM, the Production http://vufind.carli.illinois.edu server stopped responding. We tweaked the Apache 2.4 conf settings and it was running smoothly again by 9:50AM.
Our apache2.conf file contains performance tuning parameters and we discovered that the new Apache 2.4 has separate conf files that set the parameters back to their defaults. We need to make a few more changes to the Apache conf files to clean them up, but I don't see any other issues with the operating system upgrade.
If you would like to see how VuFind is performing over time, you can look at the same performance graphs that I watch: http://orca.carli.illinois.edu/vufind.html
VuFind Downtime Sunday Morning Jan 11
Friday, January 9, 2015 - 11:09am
This Sunday morning between 6AM and 10AM, we will upgrade the Production VuFind Apache/Linux server. This will cause an outage of at most 30 minutes while the operating system is being upgraded.
We are currently running on Ubuntu Server 10.04 64-bit and support ends for this release in April 2015. We are upgrading this server to 12.04, then to 14.04 which is supported until April 2019. We will also take this opportunity to improve our HTTPS encryption settings in Apache by using the recommendations from https://cipherli.st and by verifying the new settings at https://ssllabs.com.
VuFind Local Catalog Outages
Monday, December 8, 2014 - 12:29pm
This morning at 7:14AM Monday, December 8th, the SOLR search index for VuFind "local" catalogs failed (there is a separate SOLR index that handles deduplicated "consortial" searches). The java process controlling the index was pinned at 100% CPU and no requests were being processed. At 7:34AM this morning, the SOLR service was restarted restoring service. At 11:22AM the service went offline again and was restarted.
Looking at the logs, we see that there are java "OutOfMemoryError" messages. The java allocation for SOLR has been increased from 14GB to 20GB of RAM and we will continue to monitor this service.
Tuesday, September 9, 2014 - 9:40am
Last month UIUC migrated to our SFX server becoming our 53rd hosted customer. While focusing on the SFX server during this migration, we (the CARLI IT staff) wondered if there were any changes we could make that would improve overall performance. One of the tuning changes (virtual CPU allocation) was implemented this past Saturday and the results were mixed, but on Monday it became clear that SFX performance was getting much worse under load. We modified the settings again and rebooted the server during the lunch hour yesterday (Monday, Sept 8th). The new setting is stable so far and appears to have improved performance. We will monitor the system and back out the changes if we reach the same "tipping point" that we reached Monday morning.
I apologize for the SFX service disruption yesterday. As we continue to look for new ways to make our services perform better, we will do our best to avoid service outages.
Downtime Sunday, August 10th for Patching
Friday, August 1, 2014 - 4:48pm
Sunday, August 10th from 6AM to 10AM, Production Voyager, VuFind, CONTENTdm, and SFX services will be taken offline to apply operating system patches. Downtime for each service should be less than an hour.
If you discover any issues with the services after 10AM, please contact email@example.com.
Network Outage this Saturday, July 12th
Monday, July 7, 2014 - 3:51pm
This Saturday, July 12th between 4AM and Noon, UIUC campus networking staff (CITES) will replace the backbone routers connected to the Production CARLI Data Center.
For some period of time during this upgrade, all CARLI services could be unavailable.
Information about this network event, any status updates, and an announcement that work has been completed can be found on the following website:
After the upgrade, please contact firstname.lastname@example.org if you notice that any CARLI services are not working properly. We will be checking systems and restarting any that were impacted by the network outage. If there are any ongoing issues after the work is completed, they will be posted to http://www.carli.illinois.edu/system-status.
Network Switch Failure
Friday, June 6, 2014 - 9:04pm
At approximately 6:19PM today (Friday, June 6th), one of two redundant network switches in the Production Data Center went into an odd state. All lights were on, but no traffic was moving in the switch. CARLI servers are plugged into both switches for redundancy, but the failure took down all networking. The failed switch was restarted and all services were back online at 6:54PM.
We will need to look closely at the log files to see if we can determine what caused the network switch error and why the redundancy did not protect us from this error.
UIUC Database Outage April 15
Tuesday, April 15, 2014 - 4:28pm
At 10:51AM this morning, our Database Administrator changed the Oracle password for the University of Illinois at Urbana-Champaign (UIUC) database account to run some tests. He thought he was logged into the Oracle Test server, but he was actually logged into the Production Oracle server. This caused Voyager client errors, forced UIUC circulation clients into "offline circ" mode, displayed "The catalog is not available" message in UIUC's WebVoyage instance, and blocked UIUC's VuFind access. The problem was corrected and UIUC Voyager services were brought back online at 11:12AM and VuFind at 11:29AM.
I apologize for this outage. At our next IT staff meeting we will discuss ways to prevent this type of error from happening again.