COMP 440
10/24/00
PHAM
RTC: Jane Goodman
PROFESSOR: Announcement. For the people who still want to take the bonus #2, about programming, for the external merge sort, the due date for this one will be Tuesday next week. Visit the net to see the plan and the guideline. Actually, the problem is not hard, under the conditions that you have a plan and you know how to use the existed routines. If you weak in program, you might have a tough time. The sorting technique like heapsort is already there as well as the mergesort. Your job is to assemble to complete the work. Maybe you need to build some tool to reflect the problem specifications. I will talk more about this later.
I am so busy last weekend I were not able to grade your paper. You might have it back by next week. Okay? So the exam #2 will be released next week.
Remember the symposium of CSUN. We have two people in this class have been accepted. So two representatives for the department, computer science, two of them. So they have a symposium here two from our class. This is 100%, because no other classes have submitted. It will be in November 3, which is a Friday, from 1 to 3 p.m. in the North Valley Room.
So I encourage you to attend the symposium. They have only 15 minutes each. Because this is the first time they present it, I am helping them to do the transparency slides if they wish. And if they want to, they can come up to give a pre-talk to the class. It takes only 10 minutes to 15 minutes. So congratulations to these two students in here. I don't know if they want me to announce the names or not. You can see their names were published everywhere in CSUN right now. So you believe me that your survey has a merit. It doesn't matter if your work is not the NOVice (phonetic) idea, new theorem or whatsoever.
Do you have any questions?
Now, plan for today: I will talk a little bit about bonus. And then I give you the overview of chapter 15 to 20-- whatever. So you will see the large pictures of the database design.
How many people here who plan to do the bonus? Pretty good. Remember that sometimes the bonus give you too small compared with the test. But it will give you tremendous of the background and experience, which is it might help you in the future. There's a big difference between the learning and put the hand on the subject. It's a big one. Yeah. I cannot tell you how important it is. So I don't give you much time. You have only one week from today, but I think this is appropriate time, if you spend one weekend, you get it. But if you, with in the previous courses, thought about programming technique, then you might have a harder time to compose the solution.
In this bonus here, it's not concerned about heap sort or merge sort, in this bonus here, asks you to see that how do we eliminate the number of the rows inside a file. Depend on the way you decide to do the merge sort; the number of I/O will reduce drastically. And I give you some secret, okay? In the company they haven't done this one before. Okay? If you do that, you can tell them, hey, I did this one. They will grab you, first job, right away.
1
So we want to reduce I/O. You know what is I/O means, that means greet the block and write to a page, that is the cost to access the hard drive. Let me give you two hints, a better way to do so.
1
Suppose I have the 8-block page. That means the input pages will be 8. That is 8-way sort. So to sort of pass 0 and pass 1, the textbook say okay, we load 8 of them in the spool. And we do the merge sort in each one of them. And then a pass 1 is to merge them up. Actually the textbook writing so quick, we don't have to do it; we do the merge sort all already. It is already in spool. Everybody see my point of view? We have a spool here. We don't have to merge them. Just go have to sort them, using heap sort. But instead heap sort in one block, we heap sort all the 8 blocks. Do we need extra memory? No. Heap sort never need extra memory, they just swap the elements along this one.
Truly, in the reality, they don't swap the elements, I they swap the pointer. The pointer point the elements. That is an easy one. Each one of them here, they will become a page. This page here is a dash page that means a block be made into one page.
Let's see this one, interesting here. After the pass 0 and 1 we come to the pass 2 and after. We want to do merging. Merge. Now, I assume that we have the 12 pages. So label from 1, 2, 3... 8, 9, 10, 11, 12. Now, normally we take 8 of them out and merge them up to make a new run. To make a super page. So the first 8 pages merge together to get a super page. This super page -- the link of super page will be 8 pages. Each pages is 8 blocks. So this one will be like a 8 to the square of the blocks. Okay? But remember that the computer is going to snap one block at a time. So let's call this easy first to say we have 12 pages here so we use the pages here, so 8 pages. So we have a super page. Now, how many read and write we have in here? It will be 8 read and 8 write because we have 8 pages. So 8 read and 8 write. And the pages only one -- I'm sorry. 8 block per page. It can snatch (phonetic) 8 at a time.
Then this one to become a super page. Let's call that super page is 13. Now we have 9, 10, 11, 12 pages originally, of length 1 and we have another super page with length 8. Now we want to merge the 5 together. So what is the I/O time? The I/O time for the single pages in here will be 4 read, 4 write. The I/O time for the super page we just put it out, it will be 8 read and 8 write. So we add them together. We have 12 read and 12 write. Now that in we finish the job. But look at 12 read and 12 write. Before that we have 8 read, 8 write. Totally we have 8 read, 8 write for the first run. The second run will be 12 read and 12 write. So totally we have 20 read, 20 write, to complete the merging. Everybody see now?
Now, I'm going to give you another way to do it. Instead of doing 8, I do it 4 first. No. I do 5 first. Okay? I do 5 first. Then I do 8 later. Look at it how to work.
So now originally, how many page do I have? Number of pages, I have 12 pages. That original I have it. But this time I take only 5 of them first. And what do I have left in here? 7, right? The reason I left 7 is by the time I finish the first 5, I have extra super page. I can glue them up at the end, I get 8. So this is a merge together. So I will get a super page. What is I/O super page (phonetic) 5. And I label this one as #13. So look at the cost of the first run. The first run, to make this super page here, I have you 5 read, 5 write. All right? Because size 5. I read 5 and I write 5.
1
Now the second run, I load these 8 pages, page 6 to 12, pages and the single pages and the super page I just finished, 13, and I make a merge. So what is the cost? The cost for the 7 single pages, I get 7 read and 7 write. Now this one here, the super page 13, the size is 5, therefore I need 5 read and 5 write. I add them up, so I will get the 12 read and 12 write. So for the over all, I have a 5 read, 5 write for the first run. Then 12 read, 12 write for the second run. So I add them up, I have 17 read and 17 write. How much I save? Everybody see it? I have 20 read, 20 write a normal way. And I just change a little bit, I will get how many in here? 17 read, 17 write. I didn't do anything. I just changed the order. So this one is better.
How do you find a number of the pages you have to be done in the first run. That why easy. You take the number of the pages, you mark 7. You get it right away. How many pages I have here? 12. And then modulo, mod 7, I do 5. You know why I mod 7, I don't mod 8. Because each time I merge them, 8, I produce a new super page. So actually in number of pages in the file will be reduced only 7. Not reduced 8. I pull out 8, I generate the big one, I put it back. Mean the number of file only 7. So if you do mod 7, you be wonderful. Everybody understand now? So it means the same way, one way is a little bit expense -- it's a lot. You can see that 20 read, 20 write and this one is 13 read, 13 write. That mean you save quite a lot. Think about why it happen like that. So --
going to tell you something about philosophy, a story, I read when I was young.
The people, they raise the monkeys to train them to go to the cliff to pick up tea leaves; very hard to do that. Okay? And normally the monkey will receive 12 bananas per day. So the trainer normally give the monkey 8 banana in the morning and then 4 in afternoon. And one day he think that, okay, we should switch around and give the monkey 4 banana in the morning, 8 banana in the afternoon. The monkey go to strike. Interesting story? Think about that. It help me a lot in my life.
This is same way here. Look like this way here, people happy if you eat 4, 5 bananas first and do the big one later. Okay?
That is one of the ideas.
MALE STUDENT: Could you reverse it to be 7 and 5?
PROFESSOR: Before we do 8 and then 4 because 5 is the 5th one they go to. This one we reverse. We do 5 and we do the rest. And the I/O time we will gather (phonetic).
Another one is we -- like this will be interesting for you, about the duplicate.
1
Suppose I have 8 pages. And I have only one output page. So when we merge, supposed to have output page in here. We discuss this one before that, each time the output is filled up, we put to a file. So if we have 8 pages, likely the output will be filled up 8 times. Assume that we don't want to remove duplicates. But we want to remove the duplicate, then 8 pages might be crunched to only 2½. So we write 2, and the output is not filled up, only a half. The question, is should we put that one back, should we prolong that one, etc. So there's another way to do this one here. You take the remainder from the output, you put back to the page #8. You left over. So I'm not going to take it. And then you load in, this time you load in 7 block, below 8, because the 8th already occupied by the unfilled output. So you fill in 7 for the next run. So you fill in this 7 in next run, so only have 8 of them. You can do the whole 8 again. And you are going to have another output file in here. But the question in this 8 in here, we do 7 as the new one and 8 as remaining of previous run, you might get only 3 and a half, for example, so you fill up the output page 3 time and you have a half left. A half left, you are not going to put on the disk, kick it back again. That means this is fit back mechanism. We fit back to the 8. After you fit, we read another 7 again, and try to merge them again. And this time, over here, we will have another file, will be put by the output. Assume that in this 8 here, we get only like one and a half. It crunch so much we only get one and a half. So we fill up one in the output file and then we send to the hard drive and then we return back to the input, the remaining. See, I am not here to take it. Even though we clean them up. We return them up. So this mechanism in here, we cut down the number of reading. Instead of 8 chunk at a time. We cut down to 7 chunks at a time. However, you look at all the output file, super pages, this is very clear cut; you don't have anything a half page, everything is filled up. And check that this one will be faster.
MALE STUDENT: On the last run of the pass, you actually write that one out or leave it in memory for the next pass?
PROFESSOR: This case here? The last one?
MALE STUDENT: Yeah, the last run on the pass.
PROFESSOR: The last run of the pass you have to kick them out. You have to finish the file; nothing to fit them in anymore. So yeah. You are right. One of them is odd. Means it's not evenly filled up. You are right. Thank you, you are quite smart, you get it.
But one of hundred of them is good. So. You have any question? Just try this one over here. You can try this technique, compare with the other technique I talk to you. But instead, you want to run 8 at a time, the first merge, you use what? The number of page-- number of pages and modulo -- you don't modulo 7 anymore, you mod 6. If you mod 6, do the first merge. You don't do the fully 8. You do only this number. We don't know how many. You have 12 pages, you do 6. You do this way here, the last one always 8. The last one is always 8. You know why? Let me say to you this one over here. We have 8 pages in here. We reserve one page for feedback, for the unfilled output. So we have only 7 left. Now each time we merge them, we produce one super page. That means the number of pages in the hard drive reduce to 6. It's not reduced to 7 or 8. That number is why we use a mod 6. Mod 6 to tell you how many pages -- I'm sorry, how many file will be reduced in the hard drive.
MALE STUDENT: Files?
PROFESSOR: Files yes. When you pull it out it will be a file. In the merge. In the merge phase, you pull out from the file. We call page and we produce another file we call super page.
You have any question?
So when you do this run here, finish that run and enjoy, to run with different kind of schematic, and suddenly you see the I/O come out and you like it. All the things how to make a plan what is the tool you need. Subject already on my net. You sit down, you put them together.
You now I'm going to the second part to talk about the database design. And this is very important and pay attention because I give you over all pictures and then the next lecture and after, I will talk about the each chapter, so you might not know why you have to learn that. You have to refer back today's lecture.
Let's see the pictures of the database. Let's cluster the database into different kind of the big pictures. The first thing about the database is how to use it. The people learn how to use it. Suppose Oracle, suppose that is Microaccess (phonetic), you need to know the SQL. You know how to create the table. How to do the query. And how to link them, etc. In one of them that's here is the report. How to report them.
This is a very common thinking of the people about database. Which is, you don't need to be in this class.
1
The second one is we learn about database engine. We talk about the speed. We talk about different technique to reduce I/O. We talk about the plan; this is we a call join plan. Turn out to be we have a bunch of the tables. We want to join them under certain conditions and we will get the result. And which order we want to do; that is join plan. These ones concerned about the optimizer. Optimizer means to redo you say the speed. Another technique in optimizer is to reuse the result of a previous run. That is we call material view. So this is a ballpark of the research.
If you work for Oracle, likely you are going to try to improve the join plan. Minimum they pay for you is about 90K. If you work for the -- I'm sorry, Peoplesoft (phonetic) likely you go to the first blocks here how to use the SQL, you make the report, and you sell that. You sell this to the company who use the database. So that is in one section here.
The third section in here, which is design -- let's see, the proper way is database application design. This is a very important one, in my point of view. This area here, they have most of the job. For example, Bank of America have a group, a team, they want to computerize the transaction. The CEO or the president of the company doesn't know much about database and they don't want to spend the money for the Oracle to do for them, Peoplesoft. Turns out to be the people from the Oracle and Peoplesoft, they know nothing about banking. They know how to set up the table and, etc. But you have to tell them what you want to do. Therefore, the president of company will pick up several people in the bank who know a computer and ask them, okay, how do we computerize? We computerize, doesn't mean we put the record together but we want to do somehow to improve our products, the service to the people. So the people in the bank already know about how that bank operates -- manually or whatsoever. Now they want to computerize using the tool already in the market. Might be Oracle or B2, exactly doesn't matter which way. Not a big difference there is a big difference between using that and making it useful. They have to have database application. They only have one team in a company. The company treat that team like king, queen bee in the company. This one have a right to talk to the Oracle, which Oracle well love to talk to these guys because to tell them, in order to get the job done, get the contract. But this one here is the one who have the power to make the database application. So you might decide which company you want to go to in the future, etc.