We have updated Linux kernel 3.12 (now 3.12.45) and published a new version (3.18) on our HVM platform. These new versions no longer support AUFS and might force some clients to take corrective measures for their services.

Starting today, every server that is created or rebooted on our HVM platform will automatically use version 3.12.45 of the Linux kernel, unless configured to use version 3.18 or a custom kernel.

Please note that these kernel versions do not include AUFS support. Docker users should take special notice, because AUFS has been the default storage driver for quite some time.

To continue to use Docker with this new kernel version, users must upgrade their docker client and images to use a different storage driver, such as btrfs or overlayfs (available for kernel version 3.18 only).

To use version 3.18, you can execute following Gandi CLI [4] command:

$ gandi disk update  --kernel "3.18-x86_64 (hvm)"

You can also change the kernel from the web interface by following these instructions [3].

After the operation is completed, make sure you reboot your server and update your software packages and kernel modules [1].

Clients wishing to use a custom kernel can access more information on our Wiki page [2]. You can also access more information
about kernel update history on our Changelog [5]

[1] http://wiki.gandi.net/iaas/references/server/kernel_modules

[2] https://wiki.gandi.net/fr/iaas/references/server/hvm

[3] http://wiki.gandi.net/en/iaas/references/disk/advanced-boot

[4] http://cli.gandi.net

[5] https://wiki.gandi.net/fr/iaas/references/server/kernel_changelog?&#section312


A new security vulnerability, CVE-2015-3456, was announced last week. The flaw is found in the QEMU virtualization software, and permits an attacker to gain access to a vulnerable host from a virtual machine located on that host.

Immediately following this announcement, we applied the necessary patches, thus reinforcing the existing security measures we had previously implemented. Over the past week, we have continued to study the vulnerability. As a preventative measure, we have decided that a reboot of certain VMs is required in order to ensure that all possible attack vectors have been mitigated.

This preventive reboot will only affect a small proportion of our customers. We will contact affected customers directly via email to provide instructions on performing the reboot on their own.

We will reboot the VMs of affected customers (who have not rebooted on their own) on Monday, May 25 at 11:59 p.m. PDT (that is: Tuesday, May 26, 2015 at 07:59 UTC).

For more information, see the following resources:

If you have questions or encounter any problems regarding this issue, our support team is available to assist you.


Earlier this month we started the process of unifying the SFTP service on Simple Hosting across our three data centers by merging their SFTP keys. Over the next few weeks, we're taking the next step towards a more unified infrastructure with migrations of the SFTP endpoints.

While most customers will not notice any disruption in service, we want to keep you informed of our operations so you can avoid any possible issues.
Here's a schedule:

Datacenter Endpoint Date of migration
Baltimore sftp.dc1.gpaas.net December 30, 2014
Luxembourg sftp.dc2.gpaas.net January 5, 2015
Paris sftp.dc0.gpaas.net January 6, 2015

What are the possible issues?

Loss of connectivity

It's possible (though unlikely) that a running SFTP connection will lose connectivity during migration. This will happen very rarely, and will not have major consequences. Recovery will consist of simply reconnecting.

DNS / Firewall Issues

Since the IP address of each endpoint will change, DNS propagation problems may arise for some customers.

If you are having trouble connecting to the service immediately following the migration window, this may be the cause. Simply waiting for propagation should resolve it.

Also, if your firewalls or other security systems limit SFTP outbound to specific IP addresses, it will be necessary to adjust the rules on these systems to allow the new addresses (and disallow the old).

Feel free to contact Gandi Support if you encounter any issues.


An incident has occurred on one of our storage units in the Parisian datacenter. Our technical team is working to resolve the issue as quickly as possible.

UPDATE - 2:57 AM (UTC) : the situation should be back to normal. Feel free to contact our support if you notice anything wrong.


We just suffered a major incident at one of our facilities. A faulty processor caused the shutdown of a storage unit. 

 

As communications to the disk were interrupted, all operations (reboots, changes, etc) were suspended. 

 

We restarted the unit, and all the services have begun recovering. Operations were queed and are being executed once again. No data was lost. Everything should be returning to normal.

This incident started at 16:19 CEST (07:19 Pacific time). The system was recovered at 16:57, and all queed operations were fully resolved at 17:25 CEST (08:25 Pacific time). 


We do apologise for this interruption in service. 

 

As a reminder, you can see the status of our services here:

https://www.gandi.net/servstat 

You can also follow our twiiter feed from the Gandi Noc at @gandinoc. 

This news feed is available at: https://www.gandi.net/news 


On Tuesday, 7 October, we experienced a series of serious incidents affecting some of the storage units in our Parisian datacenter. These incidents caused two interruptions in service for some of our customers, affecting both Simple Hosting instances and IaaS servers.

The combined effect of these interruptions represents the most serious hosting outage we've had in three years.

First and foremost, we want to apologize. We understand how disruptive this was for many of you, and we want to make it right.

In accordance with our Service Level Agreement, we will be issuing compensation to those whose services were unavailable.

Here's what happened:

On Tuesday, October 7, shortly before 8:00 p.m. Paris time (11:00 a.m. PDT), a storage unit in our Parisian datacenter housing a part of the disks of our IaaS servers and Simple Hosting instances became unresponsive.

At 8:00 p.m., after ruling out the most likely causes, we made the decision to switch to the backup equipment.

At 9:00 p.m., after one hour of importing data, the operation was interrupted, leading to a lengthy investigation that resulted in eventually falling back to the original storage unit. Our team, having determined the culprit to be the caching equipment, proceeded to change the disk of the write journal.

At 2:00 a.m., the storage unit whose disk had been replaced was rebooted.

Between 3:00 and 5:30 a.m., the recovery from a 6-hour outage caused a heavy overload, both on the network level and on the storage unit itself. The storage unit became unresponsive, and we were forced to restart the VMs in waves.

At 8:30 a.m., all the VMs and instances were once again functional, with a few exceptions which were handled manually.

We inspected our other storage units that were using the same model of disk, replacing one of them as a precaution.

At 12:30 p.m., we began investigating some slight misbehavior exhibited by the storage unit whose drive we had replaced as a precaution.

At 3:50 p.m., three virtual disks and a dozen VMs became unresponsive. We investigated and identified the cause, and proceeded to update the storage unit while our engineers worked on the fix.

Unfortunately, this update caused an unexpected automatic reboot, causing another interruption for the other Simple Hosting instances and IaaS servers on that storage unit.

By 4:15 p.m., all Simple Hosting instances were functional again, but there were problems remounting IaaS disks. By 5:30 p.m., 80% of the disks were accessible again, with the rest following by 5:45 p.m.

This latter incident lasted about two hours (4:00 to 6:00 p.m.). During this time, all hosting operations (creating, starting, or stopping servers) were queued.

Due to the large number of queued operations, it took until 7:30 p.m. for all of them to complete.

These incidents have seriously impacted the quality of our service, and for this we are truly sorry. We have already begun taking steps to minimize the consequences of such incidents in the future, and are working on tools to more accurately predict the risk of such hardware failures.

We are also working on a customer-facing tool for incident tracking which will be announced in the coming days. 

Thank you for using Gandi, and please accept our sincere apologies. If you have any questions, please do not hesitate to contact us.

The Gandi team


Following an incident on a storage unit, it has been necessary to reboot it in order to complete an update necessary for fixing the problem.
All operations will be paused until the unit is running normally again.

In the meantime, please do NOT launch any operation on your server(s). The situation will return to normal shortly.

20:00 CEST, 11:00 Pacific: Incident officially resolved, all operations back to normal. 


An incident has occurred on one of our storage units in the Parisian datacenter. Our technical team is working to resolve the issue as quickly as possible.

Please do not perform any operations on your virtual machines in the meantime. Services should be restored automatically once the issue has been corrected.

We will update this post as new information arises.

Update Tue Oct 7 19:28:19 UTC: Some faulty hardware has been identified; we're in the process of swapping it out.

Update Tue Oct 7 22:33:14 UTC: Our technical team is still trying to fix the issue.

Update Tue Oct  7 23:35:44 UTC: A ZIL disk has failed, and its failover also failed. We're currently performing a manual switchover, and are proceeding very carefully to minimize the risk of data loss.

Most importantly: we understand how disruptive this is for you and we're working as hard as we can to fix it. We will do our best to make it right.

Update Wed Oct 8 00:39:21 UTC: Our technical team is bringing the storage unit back up. The incident is nearly resolved and services are already beginning to come back online.

Update 02:31:10 UTC: We're now seeing high loads on the problematic filer. The investigation continues!

Update 04:05:54 UTC: After working all night, our technical team in Paris has resolved the problem. Services should now be back to normal.

A postmortem and compensation details, as described in our IaaS Hosting Contract (section 2.2) will be provided in the days to come.

Update Thu Oct  9 17:31:34 UTC: A postmortem about this incident is available here.


We will reboot a storage unit on the Paris/FR datacenter tonight.

The maintenance window will start 3 October at midnight and end at 1am CEST (3-4pm PDT, 22:00-23:00 UTC) Update: the maintenance window has been extended by 30 minutes and is expected to end at 1:30am CEST (4:30pm PDT, 23:30 UTC).

You will not need to reboot your server (IaaS) or instance (PaaS) during this maintenance.

Sorry for the inconvenience.

 

Update : end of the maintenance at 2AM CEST, sorry for the delay.


We will need to reboot a storage unit as an emergency.

Indeed, a bug is at the source of this emergency maintenance.

This maintenance will not impact the data hosted on this storage unit.

The disks will take their I/O back where they stopped.

Thank you not to do any operation on your VM in the meantime.

The hosting operations will be stopped during the maintenance.

Sorry for the inconvenience this emergency maintenance may cause to you.

 


Page 1 2 37 8 9
Change the news ticker size