Fileserver Problems

Our main department fileserver is having problems providing NFS (and probably SMB) service to the systems in the department. Impacted services include the main department web server (www.cs.princeton.edu), virtual web servers, the CS public unix machines (portal) and the c2 cluster. So far, e-mail appears to be working normally.

We have collected debugging information and sent it to the vendor for analysis. We are monitoring the situation and doing what we can to keep services up. Please send e-mail to csstaff if you have any new issues to report. We will update this space as we know more.

Fileserver Problems Read More »

Emergency Downtime: Thu, Nov 6, 2008

On Thursday, November 6, 2008, we will have an emergency downtime from 4:00am to 8:00am EST.

During this time we will be working on fixing the performance and connectivity problems with our file server.

Beginning tomorrow (November 5, 2008), our vendor will be providing an on-site engineer to help diagnose the problem and to help us with our plan-of-attack for Thursday\’s early-morning downtime.

Emergency Downtime: Thu, Nov 6, 2008 Read More »

Emergency Downtime: Mon, Nov 3, 2008

On Monday, November 3, 2008, we will have an emergency downtime from 3:00am to 9:00am EST (note time).

During this time we will be shutting down our infrastructure so that we can re-install the operating system on the file server.

We continue to have performance/connectivity problems with the file server. The vendor (who has been working the issue over the weekend) is suggesting that we do a fresh install of the OS so that we have clean slate. While we don\’t know if this will correct the problems, our hope is that, at the very least, we will regain critical tracing capability that will then help us find and correct the problem.

Update 9:00am: Our systems are still down so that we can continue to debug the problem with our filesystem. We will bring our systems back online by 11:00am. Our hope is that we will have resolved the file server issue by then. However, in the event that we are unsuccessful, we will revert to our previous configuration/state that, while problematic, works more than 0% of the time.

Update 11:10am: We were not able to correct the problem with the file server and had to rollback to our previous state. We will announce another downtime soon.

Emergency Downtime: Mon, Nov 3, 2008 Read More »

Emergency Downtime: Thu, Oct 30, 2008

On Thursday, October 30, 2008, we will have an emergency downtime from 4:00am to 8:00am EDT.

During this time we will be shutting down our infrastructure so that we can apply updates to the file server. The vendor has requested that we apply these updates so that we can have the very latest configuration while we continue to diagnose the ongoing performance and connectivity issues.

Update: We applied the patches and have escalated the issue with the vendor. We are still seeing performance and connectivity problems with the file server. These problems impact project space, the cvs server, and the web site.

Update 11/1, 12:15pm: We continue to have issues. Our vendor has assigned personnel to work the problem over the weekend.

Emergency Downtime: Thu, Oct 30, 2008 Read More »

Downtime: Tuesday, October 28, 2008

On Tuesday, October 28, 2008, we will have a scheduled downtime from 4:00am to 8:00am EDT 8:00am to 8:30am EDT.

Scheduled work includes:

  • Replace the penguins cycle servers: The three old 32-bit Intel-based machines (tux, opus, willy) will be replaced with two 64-bit AMD-based machines. The two new machines will take the names opus and tux. We will deprecate the name willy.
  • OS updates for some of our infrastructure machines

SPECIAL NOTE: As we are replacing the hardware for the Linux cycle servers, all user crontabs on these machines will be deleted. You will need to backup your crontabs before the downtime, and restore them after the downtime.

Why is it happening:

  • The penguins machines are being replaced because they are outdated and have become difficult to support.
  • The infrastructure machines are getting OS patches as part of normal maintenance.

Update 10/27/08: Because we had an emergency downtime this morning, we were able to do most of the work scheduled above. We are now having a short downtime to swap the penguins cycle servers. Other parts of the department infrastructure will remain operational.

Downtime: Tuesday, October 28, 2008 Read More »

Downtime: Tue, Sept 23, 2008

On Tuesday, September 23, 2008, we will have a brief scheduled downtime from 6:30am to 7:00am EDT.

Scheduled work includes:

  • Increase the memory available to our IMAP server
  • Update the PHP version for our core webserver

With the exception of e-mail and some web services, the department infrastructure will remain up during this brief maintenance window.

Downtime: Tue, Sept 23, 2008 Read More »

Summer 2008 Downtime Schedule

Here is the maintenance schedule for the summer:

Tue, Jun 24, 2008, 4:00am-8:00am
Tue, Jul 8, 2008, 4:00am-8:00am
Tue, Jul 22, 2008, 4:00am-8:00am
Tue, Aug 5, 2008, 4:00am-8:00am
Tue, Aug 19, 2008, 4:00am-8:00am
Tue, Sep 2, 2008, 4:00am-8:00am

During these times we will be performing a variety of update, installation, and maintenance tasks.

Summer 2008 Downtime Schedule Read More »

Downtime: Tuesday, June 10, 2008

On Tuesday, June 10, 2008, we will have a scheduled downtime of the entire CS computing and networking infrastructure during normal business hours beginning at 6:00am. We don\’t have an exact completion time but anticipate that everything will be back up before 2:00pm.

Who is affected:

  • This downtime will impact all users of the departmental infrastructure. We will power down all equipment in room 218 including the network (wired and wireless), web servers, mail servers, compute servers, and file servers.

What is happening:

  • We are upgrading our battery-backup power system for room 218. This includes replacing the existing UPS (40kVA, 208V, 3φ) with a larger unit (80kVA, 480V, 3φ). Because we are reconfiguring the system to operate at 480V and installing a new bypass switch, we must power-down the entire machine room to perform the work.

Why it is happening:

  • Due to continued growth of the department\’s infrastructure, we reached the capacity of our current power configuration. This new configuration will allow us to install additional infrastructure equipment to meet the department\’s needs

Update 2:10pm:While the work is moving along smoothly, it is taking longer than anticipated. Our new estimate to be online is 5:00pm.

Update 5:05pm:A circuit breaker has failed in the new UPS. We are working with the vendor to get a replacement unit ASAP. We are hopeful that this will be first thing in the morning. We\’ll know later tonight about the specific ETA. We do not anticipate the systems coming back online tonight.

Update 6:25pm:The field engineer was able to track down a replacement circuit breaker in Virginia. It will be shipped overnight and is due to be in the building by 8:30am. The earliest we anticipate being back online is 11:00am. However, there are still several unknowns so this is still only a lower-bound.

Update Wednesday, 12:25pm: The field engineer installed the replacement circuit breaker. Unfortunately, it exhibited the same problem and now he is debugging the system. We put in a call to the vendor and the engineer\’s supervisor is on his way (with additional spare parts, if needed) to assist with the troubleshooting. We are now simultaneously working to get the new UPS online and weighing our options in the event that things continue to drag out. One option is to bring things back online without any protection from power hits; this is risky as without protection, a power event can bring down the room and damage equipment. The last time this occurred, it degraded our systems and resulted in a series of failures over a period of weeks; some of the failures led to the permanent loss of user data. In addition, we will need to bring the room back down to complete the UPS installation in any event. We understand that we must get the systems back online ASAP.

Update Wednesday, 5:25pm: Our new UPS is up and running. We have begun our normal start-up procedures. We anticipate being online at approximately 6:00pm.

Downtime: Tuesday, June 10, 2008 Read More »

Scroll to Top