Practical Session #9 - Heap, Lempel-Ziv

Data Structures, Spring 2013

Heap

Heap / A binary heap can be considered as a complete binary tree,
(the last level is full from the left to a certain point).
Maximum-Heap / For each node x, x ≥ left(x) and x ≥ right(x)
The root in a Maximum-Heap is the maximal element in the heap
Minimum-Heap / For each node x, x ≤ left(x) and x ≤ right(x)
The root in a Minimum-Heap is the minimal element in the heap
Heap-Array / A heap can be represented using an array:

A[1] - the root of the heap (not A[0])
length[A] - number of elements in the array
heap-size[A] - number of elements in the heap represented by the array A
For each index i :

Parent(i) =
Left(i) = 2i
Right(i) = 2i+1

Actions on Heaps /

Insert ~ O(log(n))
Max ~ O(1)
Extract-Max ~ O(log(n))
Build-Heap ~ O(n)
Down-Heapify(H, i) – same as MaxHeapify seen in class but starting in the node in index i ~ O(log(n))
Up-Heapify(H, i) – Fixing the heap going from node in index I to the root as in Insert() ~ O(log(n))

Question 1

Given a maximum heap H,
What is the time complexity of finding the 3 largest keys in H?
What is the time complexity of finding the C largest keys in H? (C is a constant)
What if C is not constant?

Solution:

It takes O(1) to find the 3 maximal keys in H.
The 3 maximal keys in H are:

The root
The root's maximal son y
The largest among y's maximal son and y's sibling

Finding the C largest keys in H is done in a similar way:
The i-th largest key is one of at most i keys, the sons of the i-1 largest key that have not been taken yet, thus can be found in O(i) time.
Finding the C largest keys takes O(C2) = O(1)

(Another solution would be ordering all the keys in the first C levels of the heap)

If C is not constant we have to be more efficient, We will use a new heap, where we will store nodes of the original heap so that for each node entered we will have it sons as put in this new heap but we also preserve access to its sons as in the original heap.

We first enter the root to the new heap.
Then C times we do Extract-Max() on the new heap while at each time we Insert the original sons (the sons as in the original heap) to the new heap.
The logic is as before only this time since the new heap has at most C elements we get O(C*log(C)). (note that we do not change the original heap at any point)

Question 2

Analyze the complexity of finding the i smallest elements in a set of size n using each of the following:

Sorting
Priority Queue

Sorting Solution:

Sort the numbers using algorithm that takes O(nlogn)
Find the i smallest elements in O(i)

Total = O(i+nlogn)=O(nlogn)

Priority Queue Solution:

Build a minimum heap in O(n)
Execute Extract-Min i times in O(ilogn)

Total = O(n+ilogn).
Note: n+ilogn =Θ(max(n,ilogn)) and clearly n+ilogn = O(nlogn) (and faster if i )

Question 3

In a given heap H represented using an array; suggest an algorithm for deleting the i'th indexed element.

Solution:

Delete(H, i)

if (heap_size(H) < i)

error(“heap underflow”)

key  H[i]

H[i] H[heap_size] /*Deleting the i'th element and replacing it with the last element*/

heap_sizeheap_size− 1 /*Effectively shrinking the heap*/

Down-Heapify(H, i)/*Subtree rooted in H[i] is legal heap*/

Up-Heapify(H, i)/*Area above H[i] is legal heap*/

return key

Question 4

You are given two min-heaps H1 and H2 with n1 and n2 keys respectively.
Each element of H1 is smaller than each element of H2.
How can you merge H1 and H2 into one heap H (with n1+n2 keys) using only O(n2) time? (Assume both heaps are represented in arrays of size > n1+n2)

Solution

n1 ≥ n2: Addthe keys of H2 to the array of H1 from index n1+1 to index n1+n2-1.

The resulting array represents a correct heap due to the following reasoning. Number of leaves in H1 is l1, hence if l1is eventhe number of internal nodes is l1-1 and there are 2l1 "openings" for new leaves.

If l1is odd then there are l1 internal nodesand2l1+1≥n2openings

All keys from H2 are added as leaves due to the fact stated in formula (3). As all keys in H1 are smaller than any key in H2, the heap property still will hold in the resulting array.

n1< n2: Build a new heap with all the elements in H1 and H2 in O(n1 + n2) = O(2n2)=O(n2).

Question 5

Give an O(nlogk) time algorithm to merge k sorted arraysA1 .. Ak into one sorted array.n is the total number of elements (you may assume k≤n).

Solution

We will use a min-heap of k triples of the form (d, i,Ad[i]) for 1 ≤ d ≤ k, 1≤ i ≤ n.

The heap will use an array B.

First we build the heap with the first elements listsA1...Ak in O(k) time.
In each step extract the minimal element of the heap and add it at the end of the output Array M.
If the array that the element mentioned in step two came from is not empty, than remove its minimum and add it to our heap.

for d1 to k

B[d]  (d, 1,Ad[1])

Build-Heap(B) /*By the order of Ad [1] */

for j=1 to n

(d, i, x)  Extract-Min(B)

M[j]x

if iAd.length

Heap-Insert(B,(d, i+1, Ad[i+1]))

Worst case time analysis:
Build-Heap : O(k)
Extract-Min : O(log k)
Heap-Insert : O(log k)
Total O(nlogk)

(Question 6)

Suppose that you had to implement a priority queue that only needed to store items whose priorities were integer values between 0 and k. Describe how a priority queue could be implemented so that insert() has complexity O(1) and RemoveMin() has complexity O(k). Be sure to give the pseudo-code for both operations.
Hint: This is very similar to a hash table.

Solution

Use an array A of size k+1 of linked lists. The linked list in the ith slot stores entries with priority i. To insert an item into the priority queue with priority i, append it to the linked list in the ith slot. E.g.,

insert(v)

A[v.priority].add-to-tail(v)

To remove an item from the queue, scan the array of lists, starting at 0 until a non-empty list is found. Remove the first element and return it—the for-loop induces the O(k) worst case complexity. E.g.,

removeMin()

for i  0 to k+1

List L  A[i]

if( L.empty() = false )

v  L.remove-head() /*Removes list's head and returns it = dequeue*/

return v

return NULL

חלק ב' - אלגוריתם Lempel-Ziv:

אלגוריתם לדחיסת נתונים. האלגוריתם מאפשר לשחזר את המידע הדחוס במלואו.משפחה של אלגוריתמים שמקורם בשני אלגוריתמים עיקריים שפותחו על ידי יעקב זיו ואברהם למפל בשנת 1977-8

יעקב זיו / אברהם למפל

אלגוריתמים ששייכים למשפחה זו של אלגוריתמים, מיושמים באפליקציות כגון ZIP ו - GIF image compression.

בתרגול זה נראה את אחת מהגרסאות של האלגוריתם שמקורו באלגוריתם משנת 1978 - LZ78.

הרעיון שעומד בבסיסו של הקידוד הוא: כל מילה בקידוד היא המילה הארוכה ביותר שנראתה עד כה בתוספת של אות אחת.

על מנת לעשות חיפוש יעיל של המילים שכבר קודדו נממש מבנה נתונים עצי שנקרא Trie או Prefix tree (ראו דוגמה בהמשך). העץ יהיה מעין מילון שנבנה בצורה דינמית ומייצג מילים שכבר קודדו בצורה יעילה.

מבנה הנתונים הוא היררכי. לכל צומת יש אבא ורשימה של ילדים. בנוסף כל צומת צריכה לשמור שני שדות: אינדקס – המסמן באיזה שלב הצומת הזאת נוצרה ותו שמחבר את הצומת הזאת לאבא שלו (מידע שמופיע על הצלעות בדוגמה המופיעה מטה).

צלעות שיוצאות מצומת מסוימת מייצגות אותיות שונות ששיכות לא"ב של הטקסט המקודד. השרשור של האותיות במסלול מהשורש לצומת מסוימת מייצגות מילים או רישאות של מילים שקודדו.

לפניכם הפסאודו קוד של אלגוריתם הקידוד המקבל Text כקלט. מומלץ לעבור עליו במקביל עם הדוגמה המופיעה בהמשך.

הפסאודו –קוד של האלגוריתם:

אינדקס=0
צור רשימת זוגות ריקה.
צור שורש של העץ. סמן אותו בערכו של האינדקס.
כל עוד אורך הText > 0:
קדם את האינדקס ב-1.
מצא את הרישא (prefix) הארוכה ביותר של Text שמופיעה בעץ, במסלול כלשהו מהשורש לאיזושהי צומת v1.
אם אורך הטקסט > אורך הרישא
הוסף לצומת v בן חדש, סמן אותו באינדקס, חבר ביניהם בתו שבא אחרי הרישא ב- Text.
קצץ מתחילת הטקסט את הרישא והאות שבאה אחריה.
הוסף לרשימת הזוגות, זוג שמורכב מהאינדקס של צומת v ומהאות שבאה אחרי הרישא (במידה ואין אות כזו, כלומר הגענו לסוף המילה הוסיפו תו '*'2).

הניחו שהטקסט המקורי לא יכיל את התו '*'.

1החיפוש יתבצע על ידי טיול במקביל על הטקסט ועל הצמתים של העץ: מתחילים מהאות הראשונה של הטקסט ומהשורש של העץ. נסמן את האות הנוכחית בטקסט ב- c ואת הצומת הנוכחית בעץ ב- v. מחפשים מבין הבנים של v האם מישהו מחובר אליו באות c. אם כן, מתקדמים אליו (עכשיו הוא v ) ומתקדמים בטקסט לאות הבאה (עכשיו היא c). חוזרים על התהליך עד שלא ניתן להתקדם מצומת v עם האות c. כשמסיימים, מחזירים את הצומת v (זו הצומת שאליה נחבר בן חדש בשלב 4.3.1).

2ראו את השלב האחרון של הדוגמה למטה.

הדגמה של ריצת האלגוריתם על הטקסט הבאB:

בונים עץ התחלתי ששורשו מסומן ב- 0.

מאתחלים את רשימת הזוגות אשר מייצגת את הקידוד להיות רשימה ריקה ואת האינדקס הרץ מאתחלים לאפס.

מקדמים את האינדקס ל-1 (אנו כעת במילה הראשונה שמקודדת). עוברים על הטקסט. מתחילים עם האות הראשונה בטקסט ועם השורש של העץ. לשורש אין בן שמחובר אליו ב- a ולכן עוצרים את החיפוש כאשר v זה השורש והרישא היא מילה ריקה. מוסיפים לצומת v בן חדש המסומן באינדקס 1, אשר מחובר לאבא שלו באות a. מוסיפים לרשימת הזוגות זוג חדש (0,a)– אשר מסמן שהמילה הראשונה מורכבת מהרישא שנמצאת על המסלול מהשורש לצומת המסומנת ב- 0 (מילה ריקה במקרה זה) בתוספת של האות a. מורידים מהטקסט את הרישא ואת האות a וחוזרים עם הטקסט המקוצץ acgacgat לתחילת הלולאה.

מקדמים את האינדקס ל-2. מחפשים את הרישא הארוכה ביותר של הטקסט שנמצאת בעץ –a. הרישא מסתיימת בצומת המסומנת ב-1. מוסיפים לצומת בן חדש עם אינדקס 2 ומחברים אותה לצומת האב שלה עם האות שבאה אחרי הרישא –c. מוסיפים לרשימת הזוגות את הזוג (1,c)– שמסמן שהמילה השנייה מורכבת מהרישא שנמצאת על המסלול מהשורש לצומת 1, כלומר a ותוספת של האות c. מקצצים את המילה, ונשארים עם gacgat.

מקדמים את האינדקס ל-3. מחפשים את הרישא הארוכה ביותר של הטקסט שנמצאת בעץ – מילה ריקה. הרישא מסתיימת בצומת המסומנת ב-0 (בשורש). מוסיפים לצומת בן חדש עם אינדקס 3 ומחברים אותה לצומת האב שלה עם האות שבאה אחרי הרישא –g. מוסיפים לרשימת הזוגות את הזוג (0,g)– שמסמן שהמילה השלישית מורכבת מהרישא שנמצאת על המסלול מהשורש לצומת 0, כלומר מילה ריקה ותוספת של האות g. מקצצים את המילה, ונשארים עם acgat.

מקדמים את האינדקס ל-4. מחפשים את הרישא הארוכה ביותר של הטקסט שנמצאת בעץ –ac. הרישא מסתיימת בצומת המסומנת ב-2. מוסיפים לצומת בן חדש עם אינדקס 4 ומחברים אותה לצומת האב שלה עם האות שבאה אחרי הרישא –g. מוסיפים לרשימת הזוגות את הזוג (2,g)– שמסמן שהמילה הרביעית מורכבת מהרישא שנמצאת על המסלול מהשורש לצומת 2, כלומר ac ותוספת של האות g. מקצצים את המילה, ונשארים עם at.

מקדמים את האינדקס ל-5. מחפשים את הרישא הארוכה ביותר של הטקסט שנמצאת בעץ –a. הרישא מסתיימת בצומת המסומנת ב-1. מוסיפים לצומת בן חדש עם אינדקס 5 ומחברים אותה לצומת האב שלה עם האות שבאה אחרי הרישא –t. מוסיפים לרשימת הזוגות את הזוג (1,t)– שמסמן שהמילה החמישית מורכבת מהרישא שנמצאת על המסלול מהשורש לצומת 1, כלומר a ותוספת של האות t. מקצצים את המילה, ונשארים עם מילה ריקה.

הקידוד הסופי עבור המילה B הינו רשימת הזוגות שאגרנו במשתנה Pairs.

כעת נניח שהטקסט B הוא הטקסט הקודם בתוספת של האות a:

עושים את חמשת השלבים הקודמים. בשלב השישי, מקדמים את האינדקס ל-6. מחפשים את הרישא הארוכה ביותר של הטקסט שנמצאת בעץ –a. הרישא מסתיימת בצומת המסומנת ב-1. הרישא הארוכה ביותר מגיעה לסוף הטקסט ואין איך להאריך אותה יותר באות אחת. ולכן, לא עושים יותר שינויים על העץ. כדי לא לאבד אינפורמציה על הקידוד של המילה הזו אנחנו מוסיפים את הזוג (1,*) לרשימת הזוגות.

** נקודה למחשבה : מה קורה עם נבצע זאת עם הטקסט lovesmedoesnotlovemelovesmedoesnotlovemelovesmedoesnotlovemelovesmedoesnotloveme

שחזור:

ניתן לשחזר את הטקסט המקודד מתוך רשימת הזוגות בלבד בדרך הבאה:

נחזיק מערך. ואינדקס שמתחיל מ1. האיבר ה0 במערך יהיה ריק. עבור כל זוג (i,x), נשרשר למילה שבמקום הi המערך את האות x ונכתוב את התוצאה במקום האינדקס במערך. כמו כן נוסיף את התוצאה לטקסט המשחוזר ונקדם את אינדקס ב1. (במידה ונתקלנו ב* רק נוסיף את המילה לטקסט המשוחזר)

דוגמא: