January 2013

Downtime: Week of January 28, 2013

During the week of January 28, 2013, we will be moving our centralized computing infrastructure and our cluster (\”ionic\”) to a new location.

For all infrastructure services (e-mail, CS web sites, file services), the downtime window is:

START: Tuesday, January 29, 2013, at 6:00 AM EST
END: Wednesday, January 30, 2013, at 10:00 AM EST

If you use the ionic cluster, its downtime window is longer:

START: Monday, January 28, 2013, 4:00 PM EST
END: Thursday, January 31, 2013, 10:00 AM EST

Additional Details

\”Infrastructure\” is everything except the ionic cluster and includes e-mail, web pages, file system, printing, and general purpose computing (i.e., penguins: tux and opus; cycles: soak, wash, rinse, and spin).

We are prioritizing the infrastructure over the ionic cluster. We will bring up the ionic cluster within 1 business day of bringing up the infrastructure.

The wired and wireless network in both the CS Building and in the CS section of the data center (e.g., PlanetLab, VICCI, SNS, Memex, etc.) will continue to work during the downtime. Users will be able to access University systems and the Internet from their desktops/laptops during the downtime.

Because the CS e-mail server will be down longer than 4 hours, people sending e-mail to CS accounts will get bounce messages. (These messages are generated by the sending server and are sent back to the sending account.) Properly configured senders will retry sending for 5 days so incoming messages will be delivered after the infrastructure is back online. People sending e-mail to the CS department can expect to see bounce messages of the form \”warning: message not delivered after 4 hours; will re-try for 5 days.\” (The exact message, timeouts, and retry periods are specific to the server sending the message.)

Due to the magnitude of the move, support services from CS Staff will be limited.

While all changes to infrastructure (including this move) have inherent risks, CS Staff has been taking significant steps to reduce these risks to stay within the 28-hour window for the infrastructure and within the additional 1-business-day window for the ionic cluster.

UPDATE 1/28/2013 at 4:08pm: The ionic cluster has been shut down in preparation for its move.

UPDATE 1/29/2013 at 7:06am: All servers have been powered down.

UPDATE 1/29/2013 at 8:47am: All infrastructure servers have been removed from their racks. Packing has begun.

UPDATE 1/29/2013 at 10:25am: Truck with the infrastructure servers has arrived at the new data center; unloading has begun. In the CS Building, the ionic cluster has been removed from its racks; packing has begun.

UPDATE 1/29/2013 at 11:15am: Truck with systems for the ionic cluster has left the CS Building.

UPDATE 1/29/2013 at 11:30am: Installers dropped one of our disk arrays. Assessing damage. Other work continues. For this kind of eventuality, we did engage ~~the manufacturer~~ our storage system vendor and had an on-site engineer already in place to help.

UPDATE 1/29/2013 at 12:42pm: We are working with the vendor to attempt to get a replacement disk array chassis today. Racking of other infrastructure systems continues. Truck unloading of ionic cluster continues.

UPDATE 1/29/2013 at 3:00pm: All systems are in the machine room. Approximately half are in racks. Some are starting to be wired up. Still waiting for arrival of replacement disk array chassis.

UPDATE 1/29/2013 at 4:20pm: All systems are mounted in their racks. Infrastructure systems have been cabled. Awaiting arrival of replacement disk array chassis.

UPDATE 1/29/2013 at 6:10pm: Replacement disk array chassis due to arrive by 8:15pm. We have tested several systems successfully. The disk array chassis is 1 of 7 chassis in our file server system. Even with this setback, we still believe we will meet our deadlines to be back online.

UPDATE 1/29/2013 at 8:25pm: Replacement disk array arrived. Work continues.

UPDATE 1/29/2013 at 9:50pm: Components have been moved from damaged chassis to new chassis. We are working with vendor to bring replacement chassis online.

UPDATE 1/29/2013 at 11:20pm: Chassis replacement complete and system operational. Work continues.

UPDATE 1/30/2013 at 12:10am: Infrastructure services (e-mail, CS web sites, file services) are starting to come back online.

UPDATE 1/30/2013 at 12:40am: All Infrastructure services (e-mail, CS web sites, file services) are now online. The ionic cluster is the only service that is still down. We will be bringing the ionic cluster back online sometime after 1:00pm today.

UPDATE 1/30/2013 at 8:15am: While not all nodes are yet online, the ionic cluster is operational and available for use. The remainder of the nodes that are down will be coming up ~~this morning~~when an additional power strip is installed in one of the cluster racks.

UPDATE 1/30/2013 at 10:55am: The ionic cluster is fully online. At this point, all systems should be operating normally.

Downtime: Week of January 28, 2013 Read More »

Downtime: OIT Networks, Sat, Feb 2, 2013

Downtime: Week of January 28, 2013