Solaris Cluster Overview

Solaris Cluster overview

A cluster is traditionally a collection of computers working together as one unit solving a mathematical equation. However this is only one task for a cluster. The cluster can act as a safety for hardware and software failures.

The singular computer in the cluster can in turn consist of SMP boards, a local bus with several CPU:s.

Cluster can connect to other clusters and a super cluster can be built. In those cases each cluster is usually not on the same geographic area.

Hardware protection is done by multiplication of various units whom can take over each others tasks in case of an error. This is called hardware fail over. But there are many ways doing this fail over. One is just to run in parallel, using a round robin task distribution schema. A failing unit is simply skipped out, and the current running task is lost on that unit. In very critical operations several units is solving the same task and the majority result wins. Other task distribution schemas are more or less integrated into the hardware. Round robin is sometimes to stupid so it is necessary to add on priority lists and statistics to distribute the hardware resources. The hardware is interconnected via local busses and backplanes. Backplanes are clumsy so they are converted to SCSI cabellage or fibre optics for speed, in some cases 100Mbit/s Ethernet. Between every cluster hosts, called node there is also a special interconnections called heartbeat cables, a very important part of the cluster functionality in case of node failure and handover/fall over situations. Heartbeat basically sending “ping”, an information package with status and synchronisation information as well as crash reports.

Software protection is somewhat different. It is working on top of an operating system which is extended with virtual memory, virtual devices and virtual storages. This is in turn spitted into virtual machines. The software is often called plex or just cluster software. The cluster software consists of a set of daemons and configuration. The cluster software needs a common storage and memory for cluster data and transactions. This common is called the Quorum. We will also have a special Quorum disk common between node cluster members. This disk contains all virtual devices in the cluster, that must be referred to by the operating system. If not, we would loose the virtual machine concepts of fail over. Those devices are called DID devices Disk ID. All nodes will then have the same idea of device names regardless where they are and what topology they reefer to.

Usually a cluster software wont sit on single disk slices, we would like to have something more robust.

If for example the Quorum disk (called the globaldevice) fails we would have a single point of failure if it is on one slice. In our system we use Veritas file systems. Veritas can, as many other, included Solaris own file systems. Run different fault protection setups. It is also not a good idea of having common information stored in nodes in case of node failure, we would like to store information in separate unit, accessable for all nodes in the cluster and other outside the cluster. The disks and recourses private to the cluster is called private and the other is called public. The name globaldevices are little strange, but it is global definition of private resources in the cluster. The external storage is called Disk Pack, and is basically a huge SCSI backplane with many SCSI disks, also with a controller.

This controller is connected to the nodes throughout SCSI cables or more common today, several Fibre Optics cables, one to each node. Disk Packs and nodes usually is within 500 meter radius but can be extended to much greater distance with adaptors to the campus backbone network.

Common is to have the entire system spread out over a town if it is critical.

The storage media is prepaired in different ways on raid levels or slices(partitions). The idea of having virtual devices will be explained little later.

First we have Software RAID and Hardware RAID.

Software raid is very practical and can be implemented on any machine, but is told to be not so robust as hardware raid, it also consumes more system resources than the hardware raid. There exist problems with booting on software raids. You will need special kernel support for booting on software raid levels. The boot disk in unix is known as the root disk.

Hardware raid is usually a special disk controller, this disk controller supports different raid levels in hardware. It is totally transparent to the operating environment. The disk packs have hardware raid controllers inside. There is no problem booting on hardware raid as long as the kernel supports the raid controller at boot. The hardware raid does not consume notable system resources.

Storage methods)

Plain Disk/slices (needed for bootstrap in some systems)
Disk/slice Mirroring (RAID-1), very popular
Disk/slice Striping (RAID-0) Dangerous, you can lose lots of data on a single disk failure!
Disk/slice Stripes with checksums (Raid-5), very popular

Other raid levels like 2-4 are seldom used.

Combination of Raid-1 and Raid-5 is much more common in bigger systems.

For example you start with a Raid-5 disk set and then you mirror that against another similar Raid-5.

Raid-0 is a method storing data in stripes over the involved disks, just for sum up space from many disks and or slices. If we have a disk/slice failure, the entire storage can be lost, or at least as much information as was on that disk plus 10% extra. You almost always mirror Raid-0 disks, and therefore receive raid 0-1.

Raid-1 stores the data in parallel to all disks in the mirror set, it can be two and more disks. You will loose speed when writing to the mirror set but accelerate when reading. It can start with two disks/slices and up.

Raid-5 is similar to Raid-5, with a different. It uses checksums and redundant data. This redundant data takes some space so you will need three or more slices/disks. Raid 5 is faster than Raid 1, but the extra time for calculating checksums and redundant data slows it down a little.

Very important to remember is that you can combine raid levels with each other. The disk set created in the raid will be similar to a device that is mountable.

We now move up to higher level of dealing with disks. The disk set that created one raid level is also called physical volume group. This volume group form an container which you can add and remove

Physical disks and slices into. Depending on raid level in the group those disks/slices must follow special criteria. In a mirrored volume group all disks/slices must be of same size, if not we have waste space. In a striped set, they can be of various sizes in raid-0. In Striped sets with any form of parity and redundant data they must be of same size. A disk set is also known as disk group, or simply disk set.

In a logical volume group one can add and remove disks, the logical volume group can consists of various types of physical volume groups. Logical volume groups can be built of other logical volume groups and so on. This makes it very flexible. It does not stop there, it is also possible to have a growing filesystem with no limit set from the beginning, they are set to grow when reaching a threshold level, on spare disks or spare slices allocated for the logical filesystem. Opposite it is also possible to shrink a filesystem after defragmetation. It is also possible to merge one existing volume group with another to let them grew as well as it is possible to let one volume group mirror another.

If disk failure or diskcontroller failures occur, then the data is restored in real time if mirror or raid 5 is used in the volume group. Depending on how much relative data is lost, it might not be possible to restore all. In Raid –5 roughly 75% will be possible to restore in a total crash and in mirror ~100% depending on the mirrorset of disk. With real time data restore it is nessesary to have enough diskspace, therefore you always have more or less spare disks allocated to the diskset. If there is not enough diskspace or spare disks, it is necessary to manually exchange a disk. When the system accept the new disk, data will be restored and migrate over the new disk. Depending on the raid controller it might be possible to do that without unmounting the volume grop. This behaviour is called “hot swapping”. Another important matter is the behaviour of the SCSI busses and controllers. One SCSI bus controller can take between 8 to 32 SCSI “hosts”, namely disks per bus.

Volume groups are monunted into Unix filetree as all other filesystem.

Sun cluster data services

Oracle
iPlanet Web Server
iPlanet Directory Server
Apache
Domain Name Service (DNS)
Network File System (NFS)
Oracle Parallel Server/Real Application Clusters
SAP
Sybase ASE
BroadVision One-To-One Enterprise
NetBackup

Traditionally, VERITAS Volume Manager (VxVM) has been the volume manager of choice for shared storage in enterprise-level configurations. In this article, a free and easy-to-use alternative, Solaris Volume Manager software, which is part of the Solaris 9 Operating Environment (Solaris 9 OE) is explored. This mature product offers similar functionality to VxVM. Moreover, it is tightly integrated into the Sun Cluster 3.0 software framework and, therefore, should be considered to be the volume manager of choice for shared storage in this environment.

With Sun™ Cluster 3.0 software, you can use two volume managers: VERITAS Volume Manager (VxVM) software, and Sun's Solaris™ Volume Manager software, which was previously called Solstice DiskSuite™ software.

Traditionally, VxVM has been the volume manager of choice for shared storage in enterprise-level configurations. In this Sun BluePrints™ OnLine article, we describe a free and easy-to-use alternative, Solaris Volume Manager software, which is part of the Solaris™ 9 Operating Environment (Solaris 9 OE). This mature product offers similar functionality to VxVM. Moreover, it is tightly integrated into the Sun Cluster 3.0 software framework and, therefore, should be considered to be the volume manager of choice for shared storage in this environment. It should be noted that Solaris Volume Manager software cannot be used to provide volume management for Oracle RAC/OPS clusters

To support our recommendation to use Solaris Volume Manager software, we present the following topics:

"Using Solaris Volume Manager Software With Sun Cluster 3.0 Framework" on page2 explains how Solaris Volume Manager software functions in a Sun Cluster 3.0 environment.
"Configuring Solaris Volume Manager Software in the Sun Cluster 3.0 Environment" on page10 provides a run book and a reference implementation for creating disksets and volumes (metadevices)1 in a Sun Cluster 3.0 framework.
"Advantages of Using Solaris Volume Manager Software in a Sun Cluster 3.0 Environment" on page15 summarizes the advantages of using Solaris Volume Manager software for shared storage in a Sun Cluster 3.0 environment.

NOTE

The recommendations presented in this article are based on the use of the Solaris 9 OE and Sun Cluster 3.0 update 3 software.

Using Solaris Volume Manager Software With Sun Cluster 3.0 Framework

Before we present our reference configuration, we describe some concepts to help you understand how Solaris Volume Manager software functions in a Sun Cluster 3.0 environment. Specifically, we focus on the following topics:

Sun Cluster software's use of DID (disk ID) devices to provide a unique and consistent device tree on all cluster nodes.
Solaris Volume Manager software's use of disksets, which enable disks and volumes to be shared among different nodes, and the diskset's representation in the cluster called a device group.
The use of mediators to enhance the tight replica quorum (which is different from the cluster quorum) rule of Solaris Volume Manager software, and to allow clusters to operate in the event of specific multiple failures.
The use of soft partitions and the mdmonitord daemon with Solaris Volume Manager software. While these components are not related to the software's use in a Sun Cluster environment, they should be considered part of any good configuration.

Using DID Names to Ensure Device Path Consistency

With Sun Cluster 3.0 software, it is not necessary to have an identical hardware configuration on all nodes. However, different configurations may lead to different logical Solaris OE names on each node. Consider a cluster where one node has a storage array attached on a host bus adapter (HBA) in the first peripheral component interconnect (PCI) slot. On the other node, the array is attached to an HBA in the second slot. A shared disk on target 30 may end up being referred to as /dev/rdsk/c1t30d0 on the first node and as /dev/rdsk/c2t30d0 on the other node. In this case, the physical Solaris OE device path is different on each node and it is likely that the major-minor number combination is different, as well.

In a non-clustered environment, Solaris Volume Manager software uses the logical Solaris OE names as building blocks for volumes. However, in a clustered environment, the volume definitions are accessible on all the nodes and should, therefore, be consistent; the name and the major/minor numbers should be consistent across all the nodes. Sun Cluster software provides a framework of consistent and unique disk names and major/minor number combinations. Such names are created when you install the cluster and they are referred to as DID names. They can be found in /dev/did/rdsk and /dev/did/dsk and are automatically synchronized on the cluster nodes such that the names and the major/minor numbers are consistent between nodes. Sun Cluster 3.0 uses the device ID of the disks to guarantee that the same name exists for a given disk in the cluster.

Always use DID names when referring to disk drives to create disksets and volumes with Solaris Volume Manager software in a Sun Cluster 3.0 environment.

Using Disksets to Share Disks and Volumes Among Nodes

Disksets, which are a component of Solaris Volume Manager, are used to store the data within a Sun Cluster environment.

On all nodes, local state database replicas must be created. These local state database replicas contain configuration information for locally created volumes. For example, volumes that are part of the mirrors on the boot disk. Local state database replicas also contain information about disksets that are created in the cluster: The name of the set, the names2 of the hosts that can own the set, the disks in it and whether they have a replica on them and, if configured, the mediator hosts. This is a major difference between Solaris Volume Manager software and VxVM, because in VxVM, each diskgroup is self-contained: Each disk within the group contains the group to which it belongs and the host that currently owns the group. If the last disk in a VxVM diskgroup is deleted, the group is deleted by definition.

At any one time, a diskset has a single host that has access to it. The node that has access is deemed to be the owner of the diskset and the action of getting ownership is called "take" and the action of relinquishing ownership is called "release." In VxVM terms, the take/release of a diskset are the import/export of a diskgroup. The owner of a diskset is called the current primary of that diskset. This means that although more nodes can be attached to the diskset and can potentially take the diskset upon failure of the primary node, only one node can effectively do input/output (I/O) to the volumes in the diskset. The term shared storage merits further explanation. They are not shared in the sense that all nodes access the disks simultaneously, but in the sense that different nodes are potential primaries for the set.

The creation of disksets involves three steps. First, a diskset receives a name and a primary host. This action creates an entry for the diskset in the local state database of that host. While Solaris Volume Manager allows for a maximum of 8 hosts, Sun Cluster (at this time) only supports up to 4 hosts. The rpc.metad daemon on the first node contacts the rpc.metad daemon on the second host, instructing it to create an entry for the diskset in the second host's local state database.

Now, disks can be added to the diskset. Again, the primary hosts rpc.metad daemon will contact the second host so that the local state databases on both nodes contain the same information.

Note you can add disks to any node that can potentially own the diskset and the request is forwarded (proxied) to the primary node. This is done through the rpc.metacld daemon, which allows you to administer disksets from any cluster node. Neither rpc.metad and rpc.metacld should be hardened out of a cluster that is using Solaris Volume Manager software because they are both essential to the operation of the Solaris Volume Manager software components.