====== Differences ====== This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
gibson:teaching:fall-2012:math445:lab6 [2012/10/15 10:49] gibson |
gibson:teaching:fall-2012:math445:lab6 [2012/10/17 12:14] (current) gibson |
||
---|---|---|---|
Line 29: | Line 29: | ||
**Part 2:** Compute the Google Page Rank for a network of 100 pages within ''www.unh.edu''. | **Part 2:** Compute the Google Page Rank for a network of 100 pages within ''www.unh.edu''. | ||
- | - Go to the website for “Experiments with MATLAB” by Cleve Moler and download the “surfer” | + | **(a)** Go to the website [[http://www.mathworks.com/moler/chapters.html|Numerical Computing with Matlab]] |
- | program. Run the command (may take 7-8 minutes to run): | + | by Clive Moler and download the “surfer.m” program. Set ''N=100'' and run the command |
- | [U,G] = surfer('http://www.unh.edu',) | + | ''[U,G] = surfer('http://www.unh.edu',N);''. This might take a few minutes. |
+ | |||
+ | **(b)** After some time a vector ''U'' of websites will be generated and a matrix ''G'' of links will | ||
+ | be generated. All the entries of ''G'' should be either 0 or 1. But ''G'' will have 100^2 == 10,000 entries, | ||
+ | so you can't check this by eye. Write a short piece of Matlab code that double-checks that all entries of | ||
+ | ''G'' are either 0 or 1. | ||
+ | |||
+ | |||
+ | **%%(c)%%** Now generate an ''H'' matrix as described in problem 1. You will probably run into an error in | ||
+ | part 2, namely a webpage might contain no links or only links outside the set of pages we are working with. | ||
+ | In that case we will just assume that the person randomly selects a new website among all the ''N'' pages | ||
+ | in our set. Revise your calculation of ''H'' so that if a column of ''G'' sums to zero, then each of the | ||
+ | entries in the same column of ''H'' is ''1/N''. | ||
+ | |||
+ | **(d)** Suppose that instead of always clicking on a random link within a page, a web surfer sometimes | ||
+ | (with probability alpha) jumps to a random page chosen from all the pages in the network. We can | ||
+ | model this behavior with a new transition matrix ''T'' given by | ||
+ | |||
+ | ''T = (1-alpha) H + alpha **1**/N;'' | ||
+ | |||
+ | where **1** is a matrix of ones. Rumour has it that Google uses alpha = 0.15, so use that value | ||
+ | and calculate ''T''. | ||
+ | |||
+ | **(e)** Double-check that the sum of each column of ''T'' is 1. Again, be clever and get Matlab to do | ||
+ | the work, rather than listing out all the sum of each column and verifying manually that each of the | ||
+ | ''N'' numbers is 1!. | ||
+ | |||
+ | **(f)** Let’s assume we start at the first webpage, ''x=zeros(N,1); x(1)=1;''. If we applying ''T'' | ||
+ | to ''x'' many times (say 10^5 times), the resulting vector will give the probability that we end up | ||
+ | on a particular website and after a long session of random web surfing. We can do this the slow way | ||
+ | by using the code | ||
+ | |||
+ | <code> | ||
+ | m=1e05; | ||
+ | for j=1:m; | ||
+ | x=T*x; | ||
+ | end | ||
+ | </code> | ||
+ | |||
+ | or a faster equivalent way | ||
+ | |||
+ | <code> | ||
+ | m=1e05; | ||
+ | x=(T^m)*x; | ||
+ | </code> | ||
+ | |||
+ | |||
+ | After that is complete the values of ''x'' correspond to the rank of the particular websites as | ||
+ | ordered in ''U''. Look up the ''sort'' command in the Help menu and find a way to print the list of | ||
+ | websites in order of their rank. Turn in the list of the top 10, the fraction of the time a random | ||
+ | web surfer would spend in each one of those 10, and your code to complete the Lab. Note: There is | ||
+ | a faster way to compute the solution for very large matrices, but we will not worry about that here. | ||
+ | Our method would work fine until you reached the scale of 8,000-10,000 websites. Don't try it! | ||
+ | |||
+ | **(g)** If you are interested: Our calculation involved two free parameters: the probability alpha | ||
+ | of jumping to an entirely random page in the network, and ''m'', the number that specifies the | ||
+ | length of our long session of random web surfing. How robust is the page rank algorithm with | ||
+ | respect to these two numbers? If you change alpha to 0.1 or 0.05, do the top 10 pages change? | ||
+ | How about if you change ''m'' to 10^4? | ||
+ |