pstorage-failure-domain − How to configure Parallels Cloud Storage failure domains
/etc/pstorage/location
/etc/pstorage/host_id
pstorage [−c cluster] set−attr [−R] [−p] path failure−domain=value
pstorage [−c cluster] get−attr [−p] path
A failure domain is a set of services which can fail in a correlated manner. Due to correlated failures it is very critical to scatter data replicas across different failure domains for data availability. Some failure domain examples:
• The most obvious and a minimal failure domain is a single disk. So Parallels Cloud Storage never places more then 1 replica of the data per disk/CS.
• Another obvious failure domain is a single host running multiple CS services. When the host fails (e.g., on power loss or network disconnect) all its CS services become unavailable at once. So the default Parallels Cloud Storage configuration makes sure that a single host never stores more than 1 chunk replica (See failure−domain=host below).
• In bigger multi−rack cluster setups, there are additional points of failure like per−rack switches or per−rack power units. So it is rather important to configure Parallels Cloud Storage to store data replicas across such failure domains to prevent data unavailability on massive correlated failures of a single domain. An example of how to configure per−rack failure domains is provided below.
The next section explains how Parallels Cloud Storage allows you to specify cluster services topology and configure failure domains for proper replica allocation. Please pay attention to the NOTES section.
Every Parallels Cloud Storage service component has topology information assigned to it. Topology paths define a logical tree of components' physical locations consisting of 5 identifiers which are referred to below as room.row.rack.host.CS:
/−−
...
/−− Row −/ /−− Host
Room −−| /−− Rack −−|
/−− CS
\−− Row −−| \−− Host
−−|
\−− Rack −\ \−− CS
\−− ..
The first 3 topology path components (room.row.rack) can be configured by user via /etc/pstorage/location configuration files (pstorage−config−files(7)). The last 2 components (host.cs) are auto−generated and should not be modified by user. host is a unique randomly generated host identifier generated during installation and located at /etc/pstorage/host_id (pstorage−config−files(7)). CS is a unique service identifier generated upon CS creation (pstorage−make−cs(1)).
Please pay attention that once a host has started running services in the cluster, its topology cannot be changed without fully recreating the services as described below.
To view the current services topology and per−location available disk space use the top (press i) or stat commands (pstorage−stat(1)).
Based on the above topology information, it becomes possible to define failure domains for proper file replica allocations using the pstorage−set−attr(1) utility. This command allows you to easily configure failure domains based on the above levels of hierarchy, i.e. room, row, rack, host and disk (CS):
• pstorage set−attr –R –p / failure−domain=disk – place no more then 1 replica per disk/CS
• pstorage set−attr –R –p / failure−domain=host – place no more then 1 replica per host (default)
• pstorage set−attr –R –p / failure−domain=rack – place no more then 1 replica per RACK
• pstorage set−attr –R –p / failure−domain=row – place no more then 1 replica per ROW
• pstorage set−attr –R –p / failure−domain=room – place no more then 1 replica per ROOM
We recommend you to use the same configuration for ALL cluster files as it simplifies the analysis and is less error−prone.
Once a host has started running services in the cluster, its topology is cached in MDS and cannot be changed. It means that even new services created on the host will use that cached information. If, for some reason, the host location information still needs to be modified, use the following procedure:
• Kill and remove CS/client services running on the host (using pstorage−rm−cs(1) and umount commands)
• Modify /etc/pstorage/host_id to another unique ID (e.g. generated from /dev/urandom)
• Adjust /etc/pstorage/location as required.
• Recreate services, e.g. mount pstorage and create new CS instances using pstorage−make−cs(1) command.
• For the flexibility of Parallels Cloud Storage allocator and rebalancing mechanisms, it is always recommended to have at least 5 failure domains configured in a production setup (hosts, racks, etc.).
• At least 3 replicas are recommended for running multi−rack setups.
• When a huge domain fails and goes offline, Parallels Cloud Storage does not perform data recovery by default as it may lead to replication of a tremendous amount of data which will take longer than domain repairs. This is managed by the global mds.wd.max_offline_cs_hosts configuration parameter (pstorage−config(1)) which controls the number of failed hosts to be considered as a normal disaster worth recovering in the automatic mode.
• When MDS services are created, the topology and failure domains must be taken into account manually. That is, in multi−rack setups MDSes should be created in different racks (5 MDSes in total).
• Remember that huge failure domains are more sensitive to total disk space imbalance. For example, if a domain has 5 racks, each with 10TB, 20TB, 30TB, 100TB, and 100TB total disk space, it won’t be possible to allocate (10+20+30+100+100)/3 = 86TB of data in 3 replicas. Instead, only 60TB will be allocatable, as low capacity racks will be exhausted sooner, and no 3 domains will be available for data allocation, while the largest racks (the 100TB ones) will still have available capacity.
• A huge capacity imbalance in the domains may also lead to a significant difference in load burden across the domains. For example, in a cluster with 10 x 300GB HDDs and 3 x 3TB HDDs, most of the data will be stored on 3TB drives, and as a result they may easily become a performance bottleneck (together 3TB drives deliver about 3.3x times less IOPS then 300GB drives, but they store 75% of data and are responsible for 75% of the load).
• Depending on the global parameter mds.alloc.strict_failure_domain (pstorage−config(1)), the domain policy can be strict (default) or advisory. It is HIGHLY not recommended to tune this parameter until you are absolutely sure of what you are doing.
The information on topology and failure domains can be found in certain eventlog messages:
• New CS#1033 at 10.29.1.67:60689 (0.2.2.1e8786f12dcd40b2), tier=0. Records topology paths for the newly created CSes (room.row.rack.host).
• CS#X host <hostid> was already registered with a different topology path. An error event notifying that host has already been registered at another location. Changing host location is explained above.
• Failed to allocate replicas in X domain(s), only Y domain(s) available. An error event notifying about the impossibility to allocate new chunks since too few domains are available.
An example of how to configure a 5−rack setup:
1. Put 0.0.1 to /etc/pstorage/location on all hosts from the first rack, 0.0.2 on all hosts from second rack, and so on.
2. Create 5 MDSes: 1 on any host from the first rack, 1 on any host from second rack, and so on.
3. Configure Parallels Cloud Storage to place no more than 1 replica per RACK: pstorage set−attr –R –p / failure−domain=rack
4. Create CS services as usual using the pstorage−make−cs(1) command.
Copyright © 2011−2014, Parallels, Inc. All rights reserved.
pstorage(7), pstorage−overview(7), pstorage−stat(1), pstorage−set−attr(1), pstorage−config−files(7)