Name:@ucsd.edu email:PID:

CSE-291 (Storage Systems)

Spring 2017 Homework #1

  1. Assuming equal capacities, rotational speeds, bit densities, &c, which disk drive do you think would (a) achieve a higher throughput for streaming data, and (b) achieve a lower latency for random reads, one with a 3.5” form factor, or one with a 5.25” form factor? Why? Show any work. Hint: Consider rotational latency, random seek latency, full-seek latency, head-change latency, and transfer latency.
  1. Considerthe microbenchmarking (Talaga, et al) of disk drives. The benchmarks showed saw-tooth curves. Please explain the following:
  1. What is represented by the coarsest (lowest frequency) saw tooths?
  1. Why does the frequency (length) of these coarsest (lowest frequency) saw tooths change?
  1. Why does the distance between the high point of one saw-tooth and the low point of the next saw tooth represent the rotational latency? In other words, what is happening at the end of one saw tooth that isn’t happening at the beginning of the next?
  1. Why don’t these saw tooths show the full-stroke (single seek across all tracks) latency?
  1. The paper shows a staircase pattern illustrating zones. Why do the size (length) of the level intervals between the down steps shrink? In other words, what physical characteristic of disk explains this?
  1. Given a saw tooth plot produced according to the paper, and the ability to conduct further experiments, how could you measure the full-stroke latency?
  1. When will solid state storage replace mechanical disk drives for bulk storage in data centers? Why? HintHinHint: Don’t answer in terms of years. Answer in terms of relative characteristics of the storage types.
  1. The paper that serves as the basis for your first project (Howie et al), proposed a methodology for modelling the performance characteristics of Flash-based storage. (a) Why would this model be inappropriate for mechanical disk drives? (b) Why is it appropriate for Flash-based storage? (c) What important characteristic(s) of Flash-based storage do you think aren’t modelled and account for a significant portion of the model’s error?
  1. Why do UNIX systems maintain the open file information in a global data structure rather than maintaining as part of the process state as is presently done with file descriptors?
  1. What is the advantage of “unified cache” over separate VM and file system caches? What prevents its universal adoption?
  1. In what cases should file system block caches be shared across file systems (global) versus per file system? Why?
  1. What factor moved general purpose file systems toward journaling metadata?
  1. (a) What are the key advantages of in-kernel file systems over above kernel file systems? (b) Vice versa?
  1. HDFS, Mogile, and many other file systems are implemented above kernel, without any kernel support. By comparison, what are the advantages and disadvantages of using FUSE to support user-level file systems?
  1. Why do user-level file systems persist when so many other aspects of the microkernel era have faded away? In other words, why isn’t there a prohibitively high performance penalty for many above-kernel file systems?
  1. How would the performance of Coda change if the servers of an VSG/AVSG were distributed globally versus located nearby?
  1. (a) What might cause concurrent timestamps (CVVs) for different instance of the same file in Coda? (b) How would the system correct this?
  1. Why doesn’t Haystack cache items requested by a CDN?
  1. In describing the need for Haystack, I mentioned the long tail of requests for Facebook images. (a) What is meant by “long tail” in this case? (b) What impact does it have on the cacheability of requests?
  1. Why is managing metadata so important to Haystack, but of negligible importance to Hadoop, or even, (admittedly) to a much lesser extent, to a general purpose file system?
  1. Why is it that, even though Haystack’s metadata is stored only in memory, checkpointing isn’t continuous, and logs are asynchronous, in the event of failure, metadata ahead of the long isn’t lost or particularly inefficient to recover?
  1. Consider Hadoop’s design. Why is the data stored by Datanodes is maintained with redundancy via replication, but this is not necessarily/always the case for Namenodes?
  1. Is it true that, because HDFS allows writes only at the tail of files, it needs no concurrency control among writers? If not, explain what concurrency control is needed among writers and why.