Connectivity Problems

Due to a yet-to-be identified source, we are seeing very large bursts of connections to large numbers of outside IP addresses. These hour-long bursts occurred at approximately 1:00am and 7:00pm on Sunday, and 1:00am and 7:00am on Monday. These events filled the firewall connection table and disrupted connections for about 3 hours each.

Update: While the source has been identified, we have not been able to reach the user. The traffic began again at 1:00pm today. We have disabled that port. You may notice some delays for a few more minutes while the network settles.

Connectivity Problems Read More »

Downtime: Thursday, December 18, 2008

On Thursday, December 18, 2008, we will have a scheduled downtime from 8:00am to 10:00am EST.

This downtime only affects direct and indirect users of the project file server. This includes the web servers, cycle servers, c2 cluster, ftp server, and the ftp mirror.

Note that e-mail, networking, the CVS server, and the database machines will remain operational during this time.

As one of the steps to clean up the file system mess, we will do a final sync between our temporary storage and our re-built production storage.

Downtime: Thursday, December 18, 2008 Read More »

Downtime: Thursday, December 11, 2008

On Thursday, December 11, 2008, we will have a scheduled downtime from 4:00am to 8:00am EST.

This downtime only affects direct and indirect users of the project file server. This includes the web servers, cycle servers, c2 cluster, ftp server, and the ftp mirror.

Note that e-mail, networking, the CVS server, and the database machines will remain operational during this time.

As one of the steps to clean up the file system mess, we will do a final sync between our problematic storage and temporary storage. We will then put the the temporary storage into production until we rebuild our storage pool.

Downtime: Thursday, December 11, 2008 Read More »

Downtime: Wednesday, December 10, 2008

On Wednesday, December 10, 2008, we will have a scheduled downtime from 4:00am to 8:00am EST.

This downtime only affects users of the beowulf clusters (c2, c3, and hbar). All other services (e.g., e-mail, web, databases, file servers, and cycle servers) will remain operational.

Scheduled work includes:

  • Move the nodes in the test cluster (c3) into the production cluster (c2).
  • Upgrade the production cluster (c2) to Rocks 5.
  • Decommission the hbar cluster with its single compute node.

With the participation of Jennifer Rexford, Fei-Fei Li, and David Blei, we are adding 14 additional nodes to the cluster. Eleven of these nodes have 16 GB RAM (instead of 8 GB RAM) and 8 cores per node (instead of 4 cores per node).

The hbar cluster was created specifically so that users could experiment with an 8-core machine. The expansion of c2 makes hbar obsolete. As a result, we will decommission hbar.

Downtime: Wednesday, December 10, 2008 Read More »

Emergency Downtime: Tue, Nov 25, 2008

TODAY, Tuesday, November 25, 2008, we will have an emergency downtime from 1:00pm to 2:00pm EST (note time).

During this time we will be shutting down our infrastructure so that we can (1) revert to a previous version of the software on our faulty file system, and (2) move some of our service/infrastructure file systems to a temporary volume on an alternate file server.

We are taking these actions to help alleviate some of the file system problems we are experiencing. These changes should make the department web sites, FC 010 lab, moodle, and CVS more stable. Accessing the project file space should be no worse than it is now; our expectation is that it will be better, but still under par.

Update 1:42pm: Systems are back on-line. With additional information and upon further consideration, we opted to only perform part (2) above at this time. The department web sites, FC 010 lab, moodle, and CVS should be more stable and more responsive. We postponed reverting the file system software to allow our vendor additional time to debug the problem. We are likely to revert to a previous state early tomorrow morning.

Emergency Downtime: Tue, Nov 25, 2008 Read More »

CVS Downtime / Migration

We are bringing down the department\’s CVS server and moving the content from our problematic file server to a temporary location on another file server. We expect to have it back online by noon today. We will post a follow-up when it is complete.

Update 9:40am: Because we did an initial rsync for the CVS data yesterday, the final rsync this morning completed quickly and the CVS server is now back on-line. There is still a read-only dependency between the CVS server and the problematic file system; however, we expect that the CVS performance should be much closer to normal. As usual, please report problems to csstaff@cs.

CVS Downtime / Migration Read More »

Short Notice Downtimes and File System Issues

As many of you have noticed and reported, we continue to have serious performance problems with our file server. (This is the system that serves everything except home directories.) The issue has been escalated to the highest level and is getting 24/7 attention from our vendor. To get the problem resolved as quickly as possible, we expect to have a few short downtimes (<  1 hour) with short notice (~30 minutes) this week. This notice is to make sure that you pay close attention to messages on the downtime list and on this blog. If you are a researcher and have conference/journal/proposal deadlines this week, please let us know who is working on them and when they are due. Also, if you are an instructor and have assignments due this week (that require use of the CS infrastructure), please let us know the course and time the assignment is due. At this time, the read performance from the file server is acceptable; however, the write performance is pathologically slow. Note that to get the read performance where it is now, we have to temporarily disable updating the atime (accessed time) value. Also, snapshots on the project space have been temporarily disabled. Update 12:37pm: We will have a brief downtime at 1:00pm today, when we reboot the file server. In an attempt to minimize disruption, other systems will be left up during the file server reboot, so you will notice a long pause while the reboot occurs. Things should hopefully return once the server comes back up. The downtime should be less than 30 minutes if it goes as expected.

Short Notice Downtimes and File System Issues Read More »

Emergency Downtime: Fri, Nov 21, 2008

On Friday, November 21, 2008, we will have an emergency downtime from 12:00pm to 2:00pm EST (note time).

During this time we will be shutting down our infrastructure so that we can apply a software patch to the file server.

Our file server vendor has high confidence that they have identified the bug that is the root cause of our performance and connectivity problems. They have created a custom \”patch\” for us that removes the bug. Given the severity of the problem for our infrastructure and the confidence from the vendor that this will correct the problem, we have opted to apply the patch during business hours TODAY.

Update 12:50pm: We have applied the patch and our systems are back online. Performance seems to be better; time will tell if the improvement is permanent. As always, please continue to report problems to csstaff.

Emergency Downtime: Fri, Nov 21, 2008 Read More »

Fileserver Problems

Our main department fileserver is having problems providing NFS (and probably SMB) service to the systems in the department. Impacted services include the main department web server (www.cs.princeton.edu), virtual web servers, the CS public unix machines (portal) and the c2 cluster. So far, e-mail appears to be working normally.

We have collected debugging information and sent it to the vendor for analysis. We are monitoring the situation and doing what we can to keep services up. Please send e-mail to csstaff if you have any new issues to report. We will update this space as we know more.

Fileserver Problems Read More »

Scroll to Top