User Tools

Site Tools


gibson:teaching:fall-2012:math445:lab6

====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
gibson:teaching:fall-2012:math445:lab6 [2012/10/09 05:50]
gibson
gibson:teaching:fall-2012:math445:lab6 [2012/10/17 12:14] (current)
gibson
Line 1: Line 1:
 ====== Math 445 Lab 6: Google Page Rank ====== ====== Math 445 Lab 6: Google Page Rank ======
  
-**Part 1:** +**Part 1:** Compute the Google Page Rank for a very small network of links. ​
  
 **(a)** The following directed graph represents links between eight webpages: **(a)** The following directed graph represents links between eight webpages:
Line 12: Line 12:
 for details of the Google Page Rank algorithm. ​ for details of the Google Page Rank algorithm. ​
  
-Hints+Hints  
 +  * It's good to have a firm grasp on the mathematics before writing the code.
   * Write your code in a script file and run the script from within Matlab.   * Write your code in a script file and run the script from within Matlab.
   * Use ''​G = zeros(8,​8)''​ to allocate an 8 x 8 matrix of zeros, and then insert 1s at the required spots with statements like ''​G(3,​4) = 1''​. ​   * Use ''​G = zeros(8,​8)''​ to allocate an 8 x 8 matrix of zeros, and then insert 1s at the required spots with statements like ''​G(3,​4) = 1''​. ​
-  * It's good to have firm grasp on the mathematics before writing ​the code.+  * Better yet, store the set of nonzero indices as a ''N x 2''​ matrix in file ''​links.asc'',​ then load that matrix into Matlab with ''​L = load links.asc'',​ and then loop over the ''​N''​ rows of ''​L''​ and assign ones into the specified elements of ''​G''​ with ''​G(L(n,​2),​ L(n,1)) = 1''​ where ''​n''​ is the loop index. 
 +  * The transition matrix ''​H''​ has a ''​1/​n''​ everywhere ''​G''​ has a 1, where ''​n''​ is the number of nonzero entries in the same column of ''​G''​. Instead of calculating ''​H''​ yourself and typing it in, use Matlab to calculate it from ''​G''​. Hints: ''​s = sum(G)''​ will give you an 8d row vector ''​s''​ whose elements are the sums of each of the eight columns of G. Then the ''​j''​th column of ''​H''​ equals the ''​j''​th column of ''​G''​ divided by ''​s(n)''​.
  
 +**(b)** Suppose that, instead of representing links between web pages, the directed graphs tell who likes whom among a your circle of friends, and that you use the same algorithm to compute a Likeability Rank, with the same numerical results. Would this give true quantitative values and rankings of how likeable your friends are? Why or why not? What does your answer here say about Google Page Rank as a method of ranking web pages?
  
 +I don't have a specific answer in mind here. Just be thoughtful. ​
  
  
 +----
  
-**(b)** Suppose that, instead of representing links between web pages, the directed graphs tell who likes whom among a your circle of friends, and that you use the same algorithm to compute a Likeability Rank, with the same numerical results. Would this give true quantitative values and rankings of how likeable your friends are? Why or why not? What does your answer here say about Google Page Rank as a method of ranking web pages? 
  
-I don'have a specific answer ​in mind hereJust be thoughtful+**Part 2:** Compute the Google Page Rank for a network of 100 pages within ​''​www.unh.edu''​.  
 + 
 +**(a)** ​ Go to the website [[http://​www.mathworks.com/​moler/​chapters.html|Numerical Computing with Matlab]]  
 +by Clive Moler and download the “surfer.m” program. Set ''​N=100''​ and run the command  
 +''​[U,​G] = surfer('​http://​www.unh.edu',​N);''​. This might take a few minutes.  
 + 
 +**(b)** ​ After some time a vector ''​U''​ of websites will be generated and a matrix ''​G''​ of links will 
 +be generated. All the entries of ''​G''​ should be either 0 or 1. But ''​G''​ will have 100^2 == 10,000 entries, 
 +so you can't check this by eye. Write short piece of Matlab code that double-checks that all entries of 
 +''​G''​ are either 0 or 1.  
 + 
 + 
 +**%%(c)%%** Now generate an ''​H''​ matrix as described ​in problem 1You will probably run into an error in  
 +part 2, namely a webpage might contain no links or only links outside the set of pages we are working with.  
 +In that case we will just assume that the person randomly selects a new website among all the ''​N''​ pages  
 +in our set. Revise your calculation of ''​H''​ so that if a column of ''​G''​ sums to zero, then each of the  
 +entries in the same column of ''​H''​ is ''​1/​N''​. 
 + 
 +**(d)** Suppose that instead of always clicking on a random link within a page, a web surfer sometimes  
 +(with probability alpha) jumps to a random page chosen from all the pages in the network. We can  
 +model this behavior with a new transition matrix ''​T''​ given by 
 + 
 +''​T = (1-alpha) H + alpha **1**/​N;''​ 
 + 
 +where **1** is a matrix of ones. Rumour has it that Google uses alpha = 0.15, so use that value 
 +and calculate ''​T''​.  
 + 
 +**(e)** Double-check that the sum of each column of ''​T''​ is 1. Again, ​be clever and get Matlab to do  
 +the work, rather than listing out all the sum of each column and verifying manually that each of the  
 +''​N''​ numbers is 1! 
 + 
 +**(f)** Let’s assume we start at the first webpage, ''​x=zeros(N,​1);​ x(1)=1;''​. If we applying ​ ''​T''​  
 +to ''​x''​ many times (say 10^5 times), the resulting vector will give the probability that we end up  
 +on a particular website and after a long session of random web surfing. We can do this the slow way 
 +by using the code 
 + 
 +<​code>​ 
 +m=1e05; 
 +for j=1:m; 
 +  x=T*x; 
 +end 
 +</​code>​ 
 + 
 +or a faster equivalent way  
 + 
 +<​code>​ 
 +m=1e05; 
 +x=(T^m)*x;​ 
 +</​code>​ 
 + 
 + 
 +After that is complete the values of ''​x''​ correspond to the rank of the particular websites as 
 +ordered in ''​U''​. Look up the ''​sort''​ command in the Help menu and find a way to print the list of 
 +websites in order of their rank. Turn in the list of the top 10, the fraction of the time a random 
 +web surfer would spend in each one of those 10, and your code to complete the Lab. Note: There is  
 +a faster way to compute the solution for very large matrices, but we will not worry about that here.  
 +Our method would work fine until you reached the scale of 8,​000-10,​000 websites. Don't try it!  
 + 
 +**(g)** ​ If you are interested: Our calculation involved two free parameters: the probability alpha  
 +of jumping to an entirely random page in the network, and ''​m'',​ the number that specifies the 
 +length of our long session of random web surfing. How robust is the page rank algorithm with  
 +respect to these two numbers? If you change alpha to 0.1 or 0.05, do the top 10 pages change? 
 +How about if you change ''​m''​ to 10^4? 
  
-**Part 2:** will focus on a few modifications of the Page Rank algorithm and apply the modified algorithm to a large network of webpages. 
gibson/teaching/fall-2012/math445/lab6.1349787016.txt.gz · Last modified: 2012/10/09 05:50 by gibson