Downtime: Tuesday, October 28, 2008

On Tuesday, October 28, 2008, we will have a scheduled downtime from 4:00am to 8:00am EDT 8:00am to 8:30am EDT.

Scheduled work includes:

  • Replace the penguins cycle servers: The three old 32-bit Intel-based machines (tux, opus, willy) will be replaced with two 64-bit AMD-based machines. The two new machines will take the names opus and tux. We will deprecate the name willy.
  • OS updates for some of our infrastructure machines

SPECIAL NOTE: As we are replacing the hardware for the Linux cycle servers, all user crontabs on these machines will be deleted. You will need to backup your crontabs before the downtime, and restore them after the downtime.

Why is it happening:

  • The penguins machines are being replaced because they are outdated and have become difficult to support.
  • The infrastructure machines are getting OS patches as part of normal maintenance.

Update 10/27/08: Because we had an emergency downtime this morning, we were able to do most of the work scheduled above. We are now having a short downtime to swap the penguins cycle servers. Other parts of the department infrastructure will remain operational.

Downtime: Tuesday, October 28, 2008 Read More »

Downtime: Tue, Sept 23, 2008

On Tuesday, September 23, 2008, we will have a brief scheduled downtime from 6:30am to 7:00am EDT.

Scheduled work includes:

  • Increase the memory available to our IMAP server
  • Update the PHP version for our core webserver

With the exception of e-mail and some web services, the department infrastructure will remain up during this brief maintenance window.

Downtime: Tue, Sept 23, 2008 Read More »

Summer 2008 Downtime Schedule

Here is the maintenance schedule for the summer:

Tue, Jun 24, 2008, 4:00am-8:00am
Tue, Jul 8, 2008, 4:00am-8:00am
Tue, Jul 22, 2008, 4:00am-8:00am
Tue, Aug 5, 2008, 4:00am-8:00am
Tue, Aug 19, 2008, 4:00am-8:00am
Tue, Sep 2, 2008, 4:00am-8:00am

During these times we will be performing a variety of update, installation, and maintenance tasks.

Summer 2008 Downtime Schedule Read More »

Downtime: Tuesday, June 10, 2008

On Tuesday, June 10, 2008, we will have a scheduled downtime of the entire CS computing and networking infrastructure during normal business hours beginning at 6:00am. We don\’t have an exact completion time but anticipate that everything will be back up before 2:00pm.

Who is affected:

  • This downtime will impact all users of the departmental infrastructure. We will power down all equipment in room 218 including the network (wired and wireless), web servers, mail servers, compute servers, and file servers.

What is happening:

  • We are upgrading our battery-backup power system for room 218. This includes replacing the existing UPS (40kVA, 208V, 3φ) with a larger unit (80kVA, 480V, 3φ). Because we are reconfiguring the system to operate at 480V and installing a new bypass switch, we must power-down the entire machine room to perform the work.

Why it is happening:

  • Due to continued growth of the department\’s infrastructure, we reached the capacity of our current power configuration. This new configuration will allow us to install additional infrastructure equipment to meet the department\’s needs

Update 2:10pm:While the work is moving along smoothly, it is taking longer than anticipated. Our new estimate to be online is 5:00pm.

Update 5:05pm:A circuit breaker has failed in the new UPS. We are working with the vendor to get a replacement unit ASAP. We are hopeful that this will be first thing in the morning. We\’ll know later tonight about the specific ETA. We do not anticipate the systems coming back online tonight.

Update 6:25pm:The field engineer was able to track down a replacement circuit breaker in Virginia. It will be shipped overnight and is due to be in the building by 8:30am. The earliest we anticipate being back online is 11:00am. However, there are still several unknowns so this is still only a lower-bound.

Update Wednesday, 12:25pm: The field engineer installed the replacement circuit breaker. Unfortunately, it exhibited the same problem and now he is debugging the system. We put in a call to the vendor and the engineer\’s supervisor is on his way (with additional spare parts, if needed) to assist with the troubleshooting. We are now simultaneously working to get the new UPS online and weighing our options in the event that things continue to drag out. One option is to bring things back online without any protection from power hits; this is risky as without protection, a power event can bring down the room and damage equipment. The last time this occurred, it degraded our systems and resulted in a series of failures over a period of weeks; some of the failures led to the permanent loss of user data. In addition, we will need to bring the room back down to complete the UPS installation in any event. We understand that we must get the systems back online ASAP.

Update Wednesday, 5:25pm: Our new UPS is up and running. We have begun our normal start-up procedures. We anticipate being online at approximately 6:00pm.

Downtime: Tuesday, June 10, 2008 Read More »

Emergency Downtime: Thursday, June 5, 2008

Due to a failure in our main UPS, we need to perform emergency maintenance today beginning at 12:00pm (noon). This work will require the shutdown of our main server room and is expected to last approximately 90 minutes. We realize that many people are up against conference deadlines this week and we have not made this decision lightly. At this time our systems are not protected by backup power and any power event could cause a disruption that could last for substantially more than our expected downtime.

Here\’s a close-up of the inside of the UPS unit showing charring around one of the main power cables.

\"null\"

Update: At 3:30pm, we are back up. The work was only a partial success. We now have battery backup for one power event at a time. After each event we must manually reset the system to be ready for the next event. We are all crossing our fingers that the commercial power is clean through Tuesday when we will have our scheduled downtime to connect our new, bigger UPS.

Emergency Downtime: Thursday, June 5, 2008 Read More »

File System Issues

We seem to be having issues with the server that handles project file systems. We are investigating the problem.

Update 12:37pm: We are heading into the department to perform a reboot of the file server. This involves a shutdown and restart of most of the systems.

Update 5:40pm: Systems are now back up. Due to some tempermental hardware the process did not go particularly smoothly.

File System Issues Read More »

Downtime: Monday, February 18, 2008

On Monday, February 18, 2008, we will have a scheduled downtime of the Linux public cycle servers during normal business hours.

Who is affected:

  • This downtime will impact users of the 32-bit Linux cycle servers (tux, opus, and willy) and the 64-bit Linux cycle servers (soak, wash, rinse, and spin).

What is happening:

  • The operating systems on these machines will be upgraded. Note that this upgrade will take place during normal hours. We will upgrade the machines in stages (one 32-bit and one 64-bit machine at a time) to minimize disruption. Users logged in will be given a 5-minute warning before each machine is brought down. We expect that each machine will be down for less than an hour.
  • Note that we will begin the upgrades at approximately 10:30am.

Special Note: because these upgrades involve a re-install of the operating system, crontab entries will be lost. If you have cron jobs that run on these machines, be sure to re-establish them after the upgrade.

Why it is happening:

  • These upgrades address several critical security issues.

As of 2:09pm all updates have finished.

Downtime: Monday, February 18, 2008 Read More »

Downtime: Thu, February 7, 2008

On Thursday, February 7, 2008, we will have a scheduled downtime from 4:00am to 8:00am EST.

Scheduled work includes:

  • Firmware updates to the disk array on our home directory file server
  • Upgrade virtual hosting web server to Apache 2
  • Operating system patches

Why is it happening:

  • Ordinarily, we would avoid having a downtime during the first week of classes; however, our file server vendor has classified a recent firmware update as \”critical\” meaning there is the potential for data loss.
  • While our systems are down, we plan to upgrade the virtual host web server (virtweb) that hosts URLs of the form http://[projectname].cs.princeton.edu to Apache 2. This will allow web pages to host files larger than 2G.

Downtime: Thu, February 7, 2008 Read More »

Downtime: Wednesday, January 30, 2008

On Thursday, January 30, 2008, we will have a scheduled downtime of the Linux public cycle servers during normal business hours.

Who is affected:

  • This downtime will impact users of the 32-bit Linux cycle servers (tux, opus, and willy) and the 64-bit Linux cycle servers (soak, wash, rinse, and spin).

What is happening:

  • The operating systems on these machines will be upgraded to CentOS 5.1. Note that this upgrade will take place during normal hours. We will upgrade the machines in stages (one 32-bit and one 64-bit machine at a time) to minimize disruption. Users logged in will be given a 5-minute warning before each machine is brought down. We expect that each machine will be down for less than an hour.
  • Note that we will begin the upgrades at approximately 9:00am with tux and soak.

Special Note: because these upgrades involve a re-install of the operating system, crontab entries will be lost. If you have cron jobs that run on these machines, be sure to re-establish them after the upgrade.

Why it is happening:

  • These upgrades address several security issues. We are upgrading them now to avoid disruption once the Spring term begins.

Downtime: Wednesday, January 30, 2008 Read More »

Scroll to Top