Emergency Downtime: Thu, Nov 6, 2008

On Thursday, November 6, 2008, we will have an emergency downtime from 4:00am to 8:00am EST.

During this time we will be working on fixing the performance and connectivity problems with our file server.

Beginning tomorrow (November 5, 2008), our vendor will be providing an on-site engineer to help diagnose the problem and to help us with our plan-of-attack for Thursday\’s early-morning downtime.

Emergency Downtime: Thu, Nov 6, 2008 Read More »

Emergency Downtime: Mon, Nov 3, 2008

On Monday, November 3, 2008, we will have an emergency downtime from 3:00am to 9:00am EST (note time).

During this time we will be shutting down our infrastructure so that we can re-install the operating system on the file server.

We continue to have performance/connectivity problems with the file server. The vendor (who has been working the issue over the weekend) is suggesting that we do a fresh install of the OS so that we have clean slate. While we don\’t know if this will correct the problems, our hope is that, at the very least, we will regain critical tracing capability that will then help us find and correct the problem.

Update 9:00am: Our systems are still down so that we can continue to debug the problem with our filesystem. We will bring our systems back online by 11:00am. Our hope is that we will have resolved the file server issue by then. However, in the event that we are unsuccessful, we will revert to our previous configuration/state that, while problematic, works more than 0% of the time.

Update 11:10am: We were not able to correct the problem with the file server and had to rollback to our previous state. We will announce another downtime soon.

Emergency Downtime: Mon, Nov 3, 2008 Read More »

Emergency Downtime: Thu, Oct 30, 2008

On Thursday, October 30, 2008, we will have an emergency downtime from 4:00am to 8:00am EDT.

During this time we will be shutting down our infrastructure so that we can apply updates to the file server. The vendor has requested that we apply these updates so that we can have the very latest configuration while we continue to diagnose the ongoing performance and connectivity issues.

Update: We applied the patches and have escalated the issue with the vendor. We are still seeing performance and connectivity problems with the file server. These problems impact project space, the cvs server, and the web site.

Update 11/1, 12:15pm: We continue to have issues. Our vendor has assigned personnel to work the problem over the weekend.

Emergency Downtime: Thu, Oct 30, 2008 Read More »

Downtime: Tuesday, October 28, 2008

On Tuesday, October 28, 2008, we will have a scheduled downtime from 4:00am to 8:00am EDT 8:00am to 8:30am EDT.

Scheduled work includes:

  • Replace the penguins cycle servers: The three old 32-bit Intel-based machines (tux, opus, willy) will be replaced with two 64-bit AMD-based machines. The two new machines will take the names opus and tux. We will deprecate the name willy.
  • OS updates for some of our infrastructure machines

SPECIAL NOTE: As we are replacing the hardware for the Linux cycle servers, all user crontabs on these machines will be deleted. You will need to backup your crontabs before the downtime, and restore them after the downtime.

Why is it happening:

  • The penguins machines are being replaced because they are outdated and have become difficult to support.
  • The infrastructure machines are getting OS patches as part of normal maintenance.

Update 10/27/08: Because we had an emergency downtime this morning, we were able to do most of the work scheduled above. We are now having a short downtime to swap the penguins cycle servers. Other parts of the department infrastructure will remain operational.

Downtime: Tuesday, October 28, 2008 Read More »

Downtime: Tue, Sept 23, 2008

On Tuesday, September 23, 2008, we will have a brief scheduled downtime from 6:30am to 7:00am EDT.

Scheduled work includes:

  • Increase the memory available to our IMAP server
  • Update the PHP version for our core webserver

With the exception of e-mail and some web services, the department infrastructure will remain up during this brief maintenance window.

Downtime: Tue, Sept 23, 2008 Read More »

Summer 2008 Downtime Schedule

Here is the maintenance schedule for the summer:

Tue, Jun 24, 2008, 4:00am-8:00am
Tue, Jul 8, 2008, 4:00am-8:00am
Tue, Jul 22, 2008, 4:00am-8:00am
Tue, Aug 5, 2008, 4:00am-8:00am
Tue, Aug 19, 2008, 4:00am-8:00am
Tue, Sep 2, 2008, 4:00am-8:00am

During these times we will be performing a variety of update, installation, and maintenance tasks.

Summer 2008 Downtime Schedule Read More »

Downtime: Tuesday, June 10, 2008

On Tuesday, June 10, 2008, we will have a scheduled downtime of the entire CS computing and networking infrastructure during normal business hours beginning at 6:00am. We don\’t have an exact completion time but anticipate that everything will be back up before 2:00pm.

Who is affected:

  • This downtime will impact all users of the departmental infrastructure. We will power down all equipment in room 218 including the network (wired and wireless), web servers, mail servers, compute servers, and file servers.

What is happening:

  • We are upgrading our battery-backup power system for room 218. This includes replacing the existing UPS (40kVA, 208V, 3φ) with a larger unit (80kVA, 480V, 3φ). Because we are reconfiguring the system to operate at 480V and installing a new bypass switch, we must power-down the entire machine room to perform the work.

Why it is happening:

  • Due to continued growth of the department\’s infrastructure, we reached the capacity of our current power configuration. This new configuration will allow us to install additional infrastructure equipment to meet the department\’s needs

Update 2:10pm:While the work is moving along smoothly, it is taking longer than anticipated. Our new estimate to be online is 5:00pm.

Update 5:05pm:A circuit breaker has failed in the new UPS. We are working with the vendor to get a replacement unit ASAP. We are hopeful that this will be first thing in the morning. We\’ll know later tonight about the specific ETA. We do not anticipate the systems coming back online tonight.

Update 6:25pm:The field engineer was able to track down a replacement circuit breaker in Virginia. It will be shipped overnight and is due to be in the building by 8:30am. The earliest we anticipate being back online is 11:00am. However, there are still several unknowns so this is still only a lower-bound.

Update Wednesday, 12:25pm: The field engineer installed the replacement circuit breaker. Unfortunately, it exhibited the same problem and now he is debugging the system. We put in a call to the vendor and the engineer\’s supervisor is on his way (with additional spare parts, if needed) to assist with the troubleshooting. We are now simultaneously working to get the new UPS online and weighing our options in the event that things continue to drag out. One option is to bring things back online without any protection from power hits; this is risky as without protection, a power event can bring down the room and damage equipment. The last time this occurred, it degraded our systems and resulted in a series of failures over a period of weeks; some of the failures led to the permanent loss of user data. In addition, we will need to bring the room back down to complete the UPS installation in any event. We understand that we must get the systems back online ASAP.

Update Wednesday, 5:25pm: Our new UPS is up and running. We have begun our normal start-up procedures. We anticipate being online at approximately 6:00pm.

Downtime: Tuesday, June 10, 2008 Read More »

Emergency Downtime: Thursday, June 5, 2008

Due to a failure in our main UPS, we need to perform emergency maintenance today beginning at 12:00pm (noon). This work will require the shutdown of our main server room and is expected to last approximately 90 minutes. We realize that many people are up against conference deadlines this week and we have not made this decision lightly. At this time our systems are not protected by backup power and any power event could cause a disruption that could last for substantially more than our expected downtime.

Here\’s a close-up of the inside of the UPS unit showing charring around one of the main power cables.

\"null\"

Update: At 3:30pm, we are back up. The work was only a partial success. We now have battery backup for one power event at a time. After each event we must manually reset the system to be ready for the next event. We are all crossing our fingers that the commercial power is clean through Tuesday when we will have our scheduled downtime to connect our new, bigger UPS.

Emergency Downtime: Thursday, June 5, 2008 Read More »

Scroll to Top