SaddleHill

(A SCSI IO Generator)

And

Multi-Initiator co-ordination of STP Resources in a SAS domain.

MASTER PROJECT REPORT

Advisor: Tim Chamillard

Committee Member: Xiaobo Joe Zhou

Committee Member: C. Edward Chow

By

Owen Parry

Table Of Contents

1Introduction

1.1What is SAS

1.2SATA Limitations

1.3Multi-Initiator SAS Topology

1.4Acronyms

1.5Objectives

2Research

2.1Max-Min Fairness

2.2Resource Contention Strategy

2.2.1Polling

2.2.2Signaling

2.2.3Project’s Strategy

3Design and Development

3.1SaddleHill Design

3.1.1Development Tools

3.1.2Interesting Problems and Solutions

3.1.3Application Design

3.2STP Synchronization Algorithm

4Experimental Results

4.1Setup

4.2Average IO Response

4.3Largest Response Time

4.4Synchronization Requests

4.5Time of Resource Ownership

4.6Performance

5Lessons Learnt

6Future Directions

7Conclusion

8References

Abstract

In this project, a SCSI Application client that generates and issues SCSI IOs to target devices in a SAS topology is created. We used this application as a vehicle to develop an algorithm that synchronizes the access tothe STP Resources used by Serial ATA drives in a SAS topology. We examine several Back-Off strategies for use in the synchronization algorithm; how they achieve max-min fairness objective among the systems initiators and the amount of topology traffic they generate. Additionally, the performance of the final algorithm as it relates to the achieved I/O rates for small block transfers is examined.

1Introduction

1.1What is SAS

SCSI (Small Computer Systems Interface) is an I/O system that provides the facilities for data to be transferred and stored between SCSI devices such as hard disk drives, CD-ROM, and tape devices. SCSI devicesareused in the enterprise storage market, where performance, configuration flexibility, and reliability are important requirements.

SCSI has moved from the use of parallel interconnects between devices to a serial interface and transport called SAS (Serial Attached SCSI). This allows for higher data rates (e.g. SAS-1: 300MB/s; SAS-2:600MB/s), larger disk counts, increased data integrity, and more flexibility in topology configuration.

Another competing disk I/O system is Serial ATA. Its’ target is the desktop/consumer market. SATA does not have the same stringent requirements in performance and reliability as does SCSI, the feature set is primitive and it’s architecture is unrefined. These drives do have the advantage in that they are extremely cheap and have large storage capacities; Recent SATA drives have capacities of up to 1 Terabyte.

The advent of SAS has brought the two technologies together. SATA drives can be connected to the SAS delivery system.Communication is facilitated by a tunneling protocol (SATA Tunneled Protocol) that transfers SATA commands, and Data using the native SATA transport mechanisms. Their low cost have made them attractive for use by several vendors, especially for low cost bulk storage and other non-mission critical applications.

Table 1: ATA and SCSI device comparison

SATA / SAS
Performance / RPM: 5400 – 7200
Seek Time: 8.9 - 9.5 ms / RPM: 10K – 15K
Seek Time: 3.2 – 7.4 ms
Reliability / MTBF: 500K hrs / MTBF: 1.2M Hrs
Capacity / 40Gb – 1Tb / 18Gb - 300Gb
Cost / $75 – $300 / $160 - $1400
Multi Initiator Support / Single Host / Multiple Host

1.2SATA Limitations

The SATA architecture assumes the presence and interaction with a single host. This poses a problem in the SAS domain where multiple initiators may communicate with a single device. To combat this problem the SAS specification has created a mutual exclusion mechanism by which the first Initiator to open the device indefinitely gains the mutual exclusion lock or “Affiliation” with the device. All other initiators are subsequently blocked.

1.3Multi-Initiator SAS Topology

Some vendors particularly external storage equipment makers build and deploy SAS systems that utilize multi-initiators (usually 2) in the SAS domain. Such configurations provide redundancy in the case of initiator failures. The configurations used by the vendors are varied and sometimes complex. For our purpose, we shall utilize the basic configuration where the resource limitation exists.

The following is an example of the basic configuration where STP Resource problem is evident. The figure depicts two initiators that communicate via a SAS expander to a single SATA disk drive.

Figure 1: SAS Topology that shares STP Resource of a SATA Disk

1.4Acronyms

SCSI- Small Computer Systems Interface

SAS- Serial Attached SCSI

SATA- Serial AT Attachment

STP- SATA Tunneled Protocol

1.5Objectives

The primary goals of this project are:

  • To create an application that can generate SCSI IOs, primarily of the read and write variety that transfers data of varying sizes to target devices in the SAS topology.
  • To achieve time based max-min fairness in sharing of STP resources among the initiators in the domain.
  • Share resources in a manner such that I/O timeouts are prevented.

The environment for the resource-sharing algorithm is an embedded system where memory and CPU resources are limited and communication mechanisms between initiators are often non-existent. It is therefore very important that the algorithm developed be simple and decentralized.

2Research

2.1Max-Min Fairness

The notion behind max-min fairness as it relates to time is that all the competing sources for a resource are given equal access time to the shared resource. If a source cannot fully utilize the allocated resource time then the residue is distributed among the other competitors who can.

There are several algorithms used to achieve max-min fairness in the use of a resource. Many of these algorithms were developed for wireless networks;however, they can generally be applied our targeted system.

In [2] an algorithm called Time Based Regulator based on the leaky bucket scheme was developed. The TBR algorithm equally distributes the long-term wireless channel occupancy time among the various sources. The tokens used in the algorithm represent a unit of time and are periodically generated for all sources. The sources can only access the resource if they have tokens available. To achieve the notion of fairness, TBR monitors the actual rates of the varying sources and for those that under utilize the resource their rate of token generation is reduced.The difference is divided among those others that can fully utilize the resource. The TBR algorithm is quite shrewd in providing long term max-min fairness among the competing nodes, however it needs to be deployed at a central location where it can monitor the time usage characteristics of the competitors such that it can adjust their token generation rates.

Such an algorithm can be deployed in a SAS expander device but it is not feasible to do so as these devices do not have the either the processing power or the memory capacity to implement such an algorithm. The ideal location for deployment of such an algorithm would be the SAS Initiator device as they have more CPU processing power, and memory resources, albeit still limited. A problem still arises in the fact that expanders do not collect the usage information, and additionally the network is point-to-point so there are no facilities for SAS Initiators snoop the bus and gather such information. Lastly,initiators are deployed in environments where they do not have the capability to distribute the new parameters to other initiators in the domain.

Time-Based max-min fairness was also studied in [1].One of their arguments for achieving max-min time fairness is to fix the transmission time for each node and utilize the inherent fairness of the underlying MAC layer to provide long-term equal access probability to the medium for the system nodes.

Others schemes studied to achieve this objective include reservation systems. In [5] max min fairness in terms of bandwidth allocation is studied. The channel is divided into transmission slots and it was argued that utilizing a simple round robin mechanism by which access to time slot is granted to different flows, fairness based on bandwidth allocation is achieved if the window size is sufficiently large. Inherently such a strategy achieves max min fairness based on time.

2.2Resource Contention Strategy

We take into consideration an idea proposed in [1] for the development of our STP resource sharing algorithm. An important aspect of this is to select a strategy by which the initiators have equal probability in the acquisition of the STP Affiliation.

There is a wealth of methods used to synchronize access to a limited resource. The algorithms fall into two main mechanisms, polling and signaling.

2.2.1Polling

Polling include such strategies as; busy waiting (where a process repeatedly polls until the resource is obtained); and switched spinning (where a process repeatedly checks the availability of the synchronized resource but occasionally releases the CPU to other threads).

2.2.1.1Polling Advantages

The synchronized resource is accessed quickly after being released. This allows for the high utilization of the resource among varying threads.

2.2.1.2Polling Disadvantages

Such mechanisms greedily consume resources such as CPU cycles, and network bandwidth in their polling operations. In situations where there are long waiting times, system performance and efficiencies are significantly impacted as those resources could have been used for useful work.

2.2.2Signaling

The signaling mechanism includes strategies such as blocking (where the thread is unloaded and is waken when the synchronized resource becomes available).

2.2.2.1Signaling Advantages

Decreases the usage of system resources such as CPU cycles, which are relinquished to other processes for useful work. Undoubtedly, system efficiencies and performance remain high.

2.2.2.2Signaling Disadvantages

There are high costs involved in such mechanisms, and include, the cost of, unloading, which include saving processor state and context to memory, and scheduling the current thread, and then rescheduling and reloading (restoring processor context) when signaled. When waiting times are low, the above costs become very wasteful.

Another disadvantage is the potential for a decrease in the efficiencies or utilization of the synchronized resource. Depending on the signaling mechanism and the costs of reloading the thread, there can be an increase in the amount of time the resource idle.

2.2.3Project’s Strategy

The SAS system does not provide the signaling facilities to notify the initiator devices when the STP resourcedare freed. Consequently, a polling mechanism must be used. Strategies such as Busy waiting are not ideal in the SAS environment as this will consume the limited transport and link layer resources and therefore cause starvation of hundreds of other system devices and IO timeouts at the host layer.

We utilize switch-spinning with a randomized adaptive back-off technique to poll the STP resource. The technique allows the SAS initiator to maintain liveliness as other transfers can utilize the SAS links for other work. The technique also provides equal opportunity to access the resource among the Initiators in the system.

Variations of such techniques are used in several technologies. Ethernetuses the Truncated Binary Exponential Back-Off algorithm. In this algorithm, a collision occurrence causes the doubling of the contention window size from which to randomly choose a back-off time. The BEB algorithm aims to achieve long-term fairness but has a significant limitation in that it often fails to provide the same level of fairness in the short term. The problem arises because nodes currently competing for the resource have a lower probability to acquire the resource as their contention window sizes are much larger than a node that recently owned the resource. Logarithmic back-off [3,4] is another randomized back-off technique studied and attempts to deal with the limitations of BEB algorithm. The logarithmic techniques attempt to slow the rate of increase in contention window size after failing an acquisition attempt.

In section 4 we examine several of these back-off techniques.

3Design and Development

3.1SaddleHill Design

Figure 2: SCSI IO System

Figure 2 depicts the SCSI IO system. The application SaddleHill is of the SCSI Application Layer and is composed of a lightweight driver and a user level application that generates IOs and contains the algorithm to manage affiliations.

3.1.1Development Tools

The application was designed using Borland’s Together UML modeling tool. The entire application with the exception of the driver (written in C) is written in C++ utilizing the Trolltech’s QT 4.2.3 C++ framework, on an x86_64 Linux platform. The application’s graphical user interface was built using QT Designer to aid rapid development.

3.1.2Interesting Problems and Solutions

3.1.2.1Use of Objects

The decision to use C++ was of course to take advantage of the object-oriented principles of inheritance, polymorphism, and encapsulation. One of the problems ran into while using objects to encapsulate the SAS Initiator request and reply messages is that additional information such as the virtual table is placed in the memory buffer. Obviously, this is unacceptable as the request or reply buffer contains foreign data and affects the required alignment the initiator needs. This problem was solved by allocating larger than needed buffers, overloading the new operators for the Request and Reply classes. The placement new form was used to create these objects at the end of the request reply frames, the operations of the class then use pointers to perform operations on the data elements located at the front of the frame.

3.1.2.2Memory Alignments

The application is built for a 64-bit system. There is an incompatibility in the alignments used in the long integer data types for the SAS Initiator and the application. The application aligns 64-bit values on the 8-byte boundary. To prevent memory waste these alignments are on 4-byte boundaries. To overcome this problem, when handling data buffers used by the adapter, 64 entries are accessed as a structure with two 32-bit elements.

The SAS Initiator also requires alignments of the Request and Reply Frames in system memory. Luckily, the physical memories are broken up into page boundaries aligned on 4Kb boundaries.

3.1.3Application Design

The SaddleHill application is constructed in four main logical subsystems the detailed design is specified in SaddleHill Software Design Document. They include:

Subsystem or Component / Description
MainWindow / The main window subsystem contains all the graphical widgets necessary to display and accept information to /from the user. This subsystem displays a list of LSI Logic SAS Initiators and SCSI Target devices found on the PCI Bus and SAS topology. It also displays the statistical data of the currently executing test for each device. Lastly, it displays the SaddleHill message logs. It also provides the dialog to accept Test Configuration information.
The GUI makes use of QT’s Model/View architecture, whereby it communicates with the device, statistic, and message models in the lower layers to provide the statistical data, and device information to the user.
To provide its functionality the GUI makes use of the signal and slot mechanism of the QT framework to provide indicators to the SaddleHill core of important events. For example the user pressing the “Start” button.
Management / Stores lists of LSI SAS Initiators and SAS targets found in the system. The Management unit also maintains and manages SaddleHill’s physical memory buffers that are used by the Adapter and ScsiDevice objects as request and IO data buffers; it consequently also does address translation between virtual and physical addresses.
It is also responsible for maintaining the message log buffer. Lastly, it is responsible for disseminating configuration information to the SCSI device objects, and starting and stopping tests at the behest of the user.
IO Engine / The IO Engine encompasses those classes that interact with the SCSI targets. Each target has its own engine. It is responsible for initializing the SCSI targets, generating, sending, and completing SCSI requests. It also maintains test execution statistics for the SCSI target device.
To accomplish these tasks the IO Engine has 3 main threads:
  • The IOGeneration Thread runs continuously, generates requests messages, and places them on an internal queue.
  • There is a Dispatch Thread the pulls the Requests messages off the internal queue and passes them to the HAL layer.
  • IO Completion thread dequeues request messages placed on the internal completion queue and processes the returned data.

Hardware Abstraction Layer / The HAL primarily interacts with the LSI SAS Initiator. It implements the LSI MPI specification and is responsible for initializing the SAS Adapter, converting requests from the IO Engine to the MPI specific format, sending these requests to the initiator port, and handling their responses.
The HAL has a main thread that continually runs and is responsible for accepting and completing the initiators messages.
The lightweight driver is considered an extension of the HAL. The driver is responsible for providing user level access to the systems physical memory and SAS Initiator PCIcontrol registers.
The HAL also implements the algorithm that manages the synchronization of the STP resources for each SATA disk drive in the SAS topology.Figures 3 and 4 depict the operation of the STP Resource synchronization algorithm.