A command line shell script that given a URL of a web page, returns a CSV list of URLs and bounding rectangles (using page, not screen, coordinates) for every link on the page (whether text or image), or an approriate error status and message.
Load the web page in a server-side browser (QtWebKit); inspect the DOM to retrieve link and bounding rectangles, output the list of rectangles as 5-element CSV consisting of URL and 4 rectangle coordinates (x,y,w,h).
Program may be written in perl, python or PHP, with as few external dependencies as possible, must use the QtWebKit library (it needs to have pixel-perfect compatibility with other tools built on it), and must be callable from a shell script.
Target server runs ubuntu linux and has the QtWebKit python libraries available. The program must run without any graphical output (it's not a desktop app), use of xvfb is allowed. If you have particular requirements please check with me first. You may find the code in the [url removed, login to view] tool useful.
The script file and example output.
I'd expect this to be doable in a few hours by someone familiar with these technologies.
See: [url removed, login to view]
Example call and output:
# ./[url removed, login to view] [url removed, login to view]
[url removed, login to view],200,200,200,84
[url removed, login to view],200,400,120,20
4 freelancers estão ofertando em média $120 para este trabalho
Hello, Thank you for link on sources provided. I understand your requirements and can code solution, but since i am newbie in crawling using WebKit, it will take several days to deliver. Thanks.