The past few weeks we’ve had a lot of issues with the tools the library builds to improve our user’s experiences using our various tools: custom styles and scripts for the OPAC, our discovery layer, course reserve, interlibrary loan, the link resolver, and more. Kyle, Mary, Patrick, and I want to thank you for all the reporting you’ve done to let us know when problems arise. Here’s where we stand now in our push to improve the performance of our tools.
Despite the sluggishness and downtime appearing to be related over the past several weeks, it turns out that we have been dealing with different problems at different times. Here are some of the issues we’ve worked on, only to be confronted with others:
- Too much demand on our servers. All of the interactive displays in the Mary I have been running off the same server that sends custom styles and scripts to our web tools. At times, the demands placed on this server during busy research times was too great, and it was unable to keep up with demand. We’ve put in place several changes that hope to address this particular issue, and we have more in the works:
- We’re moving all of the display-related code, including room status signs, interactve displays, adn the traffic monitoring applications, to its own server. This will take a week or more to get the new VPS (Virtual Private Server) set up and the software configured, but this will separate the resource heavy applications for the physical space from the more heavily-used scripts and styles needed for our website’s users. As a test we’ve already moved the most resource-intensive application off of our production server, and it’s made a difference. (Not enough, mind you, but it was a start.)
- I’ve started closely monitoring our access logs to see what kind of traffic we are getting and how we can improve server performance by looking at this usage data. I’ve long made performance tweaks to our tools based on how they perform in a user’s browser, but this was a new thing for me, watching how the same applicatoins tax our server. From this analysis we discovered that several of our applications were using far more resources than necessary. For instance, the events on the library screens were each checking for emergency closure messages every 5 seconds, meaning that every minute, those diplays were hitting the server 48 times alone just for that task. We scaled that back to every 5 minutes, which was much more reasonable. In addition, I was able to see how Frey was using the computer availability map, which was designed to run in kiosk mode with very few assets. Frey was loading the page with the complete GVSU library tempate in tact, though, and so every 2 minutes their machine was requesting all the scripts and styles and images and HTML necessary for the whole web page. We’ve scaled the refresh on that page to use far fewer resources, while still balancing the need for a quicker refresh for the kiosks.
- I’ve implemented server caching on our production server. I had honestly thought we’d already done this years ago, but only development was caching assets. Caching saves a lot of work on the server because it will only send new versions of a file if it has changed since the last time the user’s browser has loaded it, or if the cache has expired. Now, instead of loading the custom styles and scripts for every catalog search, users will use the version their browser has cached most of the time. (Not always, since a lot of our traffic comes from computers in on-campus labs, where the cache is cleared with every logout. But every little bit helps.)
- I’ve blocked a number of malicious bots that were crawling our pages and hammering our servers. There is a Chinese spider (called Baidu) that requests a page from our server every 20 seconds, 24 hours a day. Another Russian spider takes a different tack, picking three hours out of the day to hammer us with as many as 60 requests a minute. Both now receive [403 HTTP codes (Forbidden)](403 HTTP code (Forbidden)), and we’ve taken other steps to keep them from attacking our server.
Our server host moved our production server to a new machine, changing the IP address.
This came at a lousy time, since we were in the middle of trying to solve all of these other issues. Yesterday Kyle, Mary, and I were troubleshooting the traffic monitoring application because it wasn’t connecting to the database. After a while., I thought to test another application that uses the database (in this case, the Status App, and sure enough, it was down too.
What made this even more difficult to troubleshoot was the fact that to the user the downtime didn’t look any different than when we had an overloaded server. So the report that came in about the traffic app said that it had been down since last week (although no one had reported it then). After looking in the database, we saw that the traffic app had been recording entries until 10:30am on Monday, so it hadn’t been down at all. But it likely had trouble loading at times, because of the other server issues. Had we been told that it was down starting that morning, we would have been more likely to zero in on the IP change. But there was no way for anyone else to know that there was a difference.
- Our server host upgraded our server last week, before the move. In the midst of all of our changes to improve server performance, the host also did some upgrades, which they told us would “probably mean some sluggishness or downtime.” Great. We had no way of knowing if the changes we were making helped, because we had multiple issues that could be causing the same symptoms, one of which we couldn’t control.
So far looking at the server data has been encouraging. It looks like things are working better, and once we finish moving the display applications to their own VPS, we’ll likely be more stablized. However, just so we don’t all get our hopes up, our server host will be upgrading the vesion of PHP our server runs sometime in the near future (they said “beginning March 1st…) so it could be any time). That may have no effect, or some of our tools could break. I’m working on testing my scripts under the new PHP version to try to head off problems before we are pushed live. That’s just one of the down sides of outsourcing our hosting.
If you do experience any sluggishness loading any of our sites or tools, please let us know through the problem report form. Give us as much detail as possible, including the browser you were using, where you were (a lot of reports are coming from the 001 Lab, so I’m curious if there is something in the browser image that is slowing things down), and if you’re in a search tool, how you got there. (Searching Summon from the homepage as opposed to going directly to Summon makes a difference, for instance.)
As always, let me know if you have any questions or concerns. And thanks for being patient as we work through this!