User Tools

Site Tools


gibson:teaching:fall-2012:math445:lab6

====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
gibson:teaching:fall-2012:math445:lab6 [2012/10/08 20:05]
gibson
gibson:teaching:fall-2012:math445:lab6 [2012/10/17 12:14] (current)
gibson
Line 1: Line 1:
 ====== Math 445 Lab 6: Google Page Rank ====== ====== Math 445 Lab 6: Google Page Rank ======
  
-**Part 1:** +**Part 1:** Compute the Google Page Rank for a very small network of links. ​
  
 **(a)** The following directed graph represents links between eight webpages: **(a)** The following directed graph represents links between eight webpages:
Line 8: Line 8:
  
 Give the Page Rank of each page, and provide a list of the page indices Give the Page Rank of each page, and provide a list of the page indices
-sorted by their Page Ranks. You can refer to class notes or [[http://​www.ams.org/​samplings/​feature-column/​fcarc-pagerank]]+sorted by their Page Ranks. You can refer to class notes or  
 +[[http://​www.ams.org/​samplings/​feature-column/​fcarc-pagerank]]
 for details of the Google Page Rank algorithm. ​ for details of the Google Page Rank algorithm. ​
  
-**(b)** Suppose that, instead of representing links between web pages, the directed graphs tell who likes whom among a network ​of acquaintances, and that you use the same algorithm to compute a Likeability Rank, with the same numerical results. Would this give true quantitative values and rankings of how likeable ​the eight people ​are? Why or why not? What does your answer here say about Google Page Rank as a method of ranking web pages?+Hints  
 +  * It's good to have a firm grasp on the mathematics before writing the code. 
 +  * Write your code in a script file and run the script from within Matlab. 
 +  * Use ''​G = zeros(8,​8)''​ to allocate an 8 x 8 matrix of zeros, and then insert 1s at the required spots with statements like ''​G(3,​4) = 1''​.  
 +  * Better yet, store the set of nonzero indices as a ''​N x 2''​ matrix in a file ''​links.asc'',​ then load that matrix into Matlab with ''​L = load links.asc'',​ and then loop over the ''​N''​ rows of ''​L''​ and assign ones into the specified elements of ''​G''​ with ''​G(L(n,​2),​ L(n,1)) = 1''​ where ''​n''​ is the loop index. 
 +  * The transition matrix ''​H''​ has a ''​1/​n''​ everywhere ''​G''​ has a 1, where ''​n''​ is the number of nonzero entries in the same column of ''​G''​. Instead of calculating ''​H''​ yourself and typing it in, use Matlab to calculate it from ''​G''​. Hints: ''​s = sum(G)''​ will give you an 8d row vector ''​s''​ whose elements are the sums of each of the eight columns of G. Then the ''​j''​th column of ''​H''​ equals the ''​j''​th column of ''​G''​ divided by ''​s(n)''​. 
 + 
 +**(b)** Suppose that, instead of representing links between web pages, the directed graphs tell who likes whom among a your circle ​of friends, and that you use the same algorithm to compute a Likeability Rank, with the same numerical results. Would this give true quantitative values and rankings of how likeable ​your friends ​are? Why or why not? What does your answer here say about Google Page Rank as a method of ranking web pages?
  
 I don't have a specific answer in mind here. Just be thoughtful. ​ I don't have a specific answer in mind here. Just be thoughtful. ​
 +
 +
 +----
 +
 +
 +**Part 2:** Compute the Google Page Rank for a network of 100 pages within ''​www.unh.edu''​. ​
 +
 +**(a)** ​ Go to the website [[http://​www.mathworks.com/​moler/​chapters.html|Numerical Computing with Matlab]] ​
 +by Clive Moler and download the “surfer.m” program. Set ''​N=100''​ and run the command ​
 +''​[U,​G] = surfer('​http://​www.unh.edu',​N);''​. This might take a few minutes. ​
 +
 +**(b)** ​ After some time a vector ''​U''​ of websites will be generated and a matrix ''​G''​ of links will
 +be generated. All the entries of ''​G''​ should be either 0 or 1. But ''​G''​ will have 100^2 == 10,000 entries,
 +so you can't check this by eye. Write a short piece of Matlab code that double-checks that all entries of
 +''​G''​ are either 0 or 1. 
 +
 +
 +**%%(c)%%** Now generate an ''​H''​ matrix as described in problem 1. You will probably run into an error in 
 +part 2, namely a webpage might contain no links or only links outside the set of pages we are working with. 
 +In that case we will just assume that the person randomly selects a new website among all the ''​N''​ pages 
 +in our set. Revise your calculation of ''​H''​ so that if a column of ''​G''​ sums to zero, then each of the 
 +entries in the same column of ''​H''​ is ''​1/​N''​.
 +
 +**(d)** Suppose that instead of always clicking on a random link within a page, a web surfer sometimes ​
 +(with probability alpha) jumps to a random page chosen from all the pages in the network. We can 
 +model this behavior with a new transition matrix ''​T''​ given by
 +
 +''​T = (1-alpha) H + alpha **1**/​N;''​
 +
 +where **1** is a matrix of ones. Rumour has it that Google uses alpha = 0.15, so use that value
 +and calculate ''​T''​. ​
 +
 +**(e)** Double-check that the sum of each column of ''​T''​ is 1. Again, be clever and get Matlab to do 
 +the work, rather than listing out all the sum of each column and verifying manually that each of the 
 +''​N''​ numbers is 1!. 
 +
 +**(f)** Let’s assume we start at the first webpage, ''​x=zeros(N,​1);​ x(1)=1;''​. If we applying ​ ''​T'' ​
 +to ''​x''​ many times (say 10^5 times), the resulting vector will give the probability that we end up 
 +on a particular website and after a long session of random web surfing. We can do this the slow way
 +by using the code
 +
 +<​code>​
 +m=1e05;
 +for j=1:m;
 +  x=T*x;
 +end
 +</​code>​
 +
 +or a faster equivalent way 
 +
 +<​code>​
 +m=1e05;
 +x=(T^m)*x;
 +</​code>​
 +
 +
 +After that is complete the values of ''​x''​ correspond to the rank of the particular websites as
 +ordered in ''​U''​. Look up the ''​sort''​ command in the Help menu and find a way to print the list of
 +websites in order of their rank. Turn in the list of the top 10, the fraction of the time a random
 +web surfer would spend in each one of those 10, and your code to complete the Lab. Note: There is 
 +a faster way to compute the solution for very large matrices, but we will not worry about that here. 
 +Our method would work fine until you reached the scale of 8,​000-10,​000 websites. Don't try it! 
 +
 +**(g)** ​ If you are interested: Our calculation involved two free parameters: the probability alpha 
 +of jumping to an entirely random page in the network, and ''​m'',​ the number that specifies the
 +length of our long session of random web surfing. How robust is the page rank algorithm with 
 +respect to these two numbers? If you change alpha to 0.1 or 0.05, do the top 10 pages change?
 +How about if you change ''​m''​ to 10^4?
 +
 +
gibson/teaching/fall-2012/math445/lab6.1349751917.txt.gz · Last modified: 2012/10/08 20:05 by gibson