Name:AndrewID:

CSE 291 SAMPLE Final Exam

Spring 2016(Kesden)

Data-Intensive Scalable Computing (DISC)

  1. Please explain the relationship of each of the following file system attributes to the performance of Hadoop:
  2. Block size
  3. Location awareness
  4. Block-level replication
  5. Append-only writes
  6. Random reads
  1. Hadoop uses both local and distributed file systems. Why? Specifically, why doesn’t it utilize only a local file system? Or only a distributed file system?
  1. Why does Hadoop sort records en route to a Reducer? How would it affect things if these records were processed by the Reducer in the order in which they were received from the various Mappers?
  1. During which phase, map or reduce, is it parallelism most greatly exploited? Why?

Distributed Computing Models

  1. If processor allocation is optimal, is it possible that migration will subsequently improve system performance? If not, why not? If so, how?
  1. What type of tasks would you expect to benefit from Hadoop over OpenMPI? Why?
  1. When is a shared memory programming model typically a better approach than message passing? Why? Is anything special required or acutely helpful? Why?

Distributed File Systems

  1. Write-write conflicts are very challenging in many distributed data structures and data stores. How does AFS manage these conflicts?
  1. Why do file systems, such as AFS, often have it relatively easy w.r.t. write-write conflicts as compared to other distributed applications? Why do system such as HDFS and Mogile have it even easier?

Virtual Machines

  1. In a broad array of distributed systems applications, the delivery of services via virtual machines offers a number of advantages over the direct use of real-world hardware. Considering only such applications, briefly describe and explain what, in your view, are the three most significant advantages.
  1. Consider computation, disk I/O, network I/O, human input such as via keyboards and mice, and GPU computation. Please rank these in terms of the penalty encountered through virtualization on same-architecture virtual machines (consider VMware or KVM, if you’d like). Then, please explain the underlying reasons that each type of function experiences the relative VM penalty that it does.

Cryptography, Cryptographic Protocols, Privacy, and Authentication

  1. In class, we discussed Tor(“The Onion Routing”) anonymous routing. Tor is presently used only by an extremely small community of Internet users. Imagine that the existing system was suddenly adopted by a sizable portion of the Internet. Which aspects would readily scale up? Which aspects would be problematic at scale? Why?
  1. Consider those aspects of Tor that you identified as problematic at scale. How could you redesign their implementation for relative efficiency at Internet scale? (Hint: This is a distributed systems “design” problem.)
  1. What is the biggest challenge in the enterprise-wide or global use of public key cryptography? Why?
  1. Consider the weakness you identified in question #14. Please identify and explain two ways in which it is addressed in well-known systems, as well as the intrinsic weakness of these approaches.