Looking to get this done in php or perl.? Requirements are as follows. Given a top? level domain, get all the urls of that website. So The list of urls should be comprehensive and unique. So if a website has 2000 pages, we should have all those 2000 urls.
I should be able to go to? web page to fire off the crawling of that website. Yes, the crawling could take a few minutes to even a few hours if the website has lots of pages.
It should then dsiplay the pages on the browser. for each page, i should be able to see which other pages on the site link to this page and using what anchor text. So if page A is linked from page B,C,D,E. I should be able to see that page B&C link to A using anchor text "blah1" and pages D&E link to page A using anchor text "blah2". The above is true for internal links within the website.
Also, for each page on the website, would like to query the yahoo siteexplorer api <[url removed, login to view]>
and get the external backward links for every url on that site.