1) I have a threaded java application (I'm in the process of writing it in Python as well) that will spider urls that it finds on a news site. It does this by visiting the web site, collecting the urls present on the HTML page, and placing those urls into an array. (I've left out the boring details).
2) Next, my spider will go to a url, from the above array, that it determines is a news article (I've got this part covered via regular expressions, and eventually Bayesian statistics).
3) Once at the news article, based 100% on regular expressions, it will grab the first 20 or so words of text (50 - 100 chars), and place the headline title, the url, and the text into an array.
4) Finally, skipping the boring details, the end result is something similar to any news aggregator out there, a regularly updated JSP page that has a headline title, a url, and a short summary.
OK, so where do you come in?
I need a code snippet written in either java or python, that will do step 3 automatically. Please note, this is not simply writing regular expressions for one news site, and thinking that the job is done. I want step 3 to work on any news site in the world, much the same way that step 1 (collecting urls) will work on any web site in the world.
I've successfully written regular expressions for 4 different news sites. This is not difficult, but it is certainly tedious.
So here is my request in pseudo code:
Assumption is that the page the code is looking at is a news article.
Read page line by line or all at once
Replace all "HTML, SCRIPT, etc tags", ""
Find what it determines (this is your work) to be the text body
Read text body
Take out a portion of text body
Place portion into a variable
At this point my code takes over, making sure that the above variable is placed (associated with the correct title, and url) in the array. If you want, you could do that part too. It doesn't matter, it should only be a few seconds work.
Conclusion: It must work on any news article HTML page from any news web site in the world (well if it will work in English only that is 90% of the battle).
Keep in mind that the regular expressions that indicate the beginning and end of the article text body is different for each news site. It is your job to solve that problem.