Hi guys! I want to make a database storing a selection of web-published articles (either at my site or in print), but I want to be able to change the original publishing stylesheet for readability purposes. Moreover, I would like to have only the body of the article, without all the other columns that the newspaper presents on its site. Here is a typical article of the specific newspaper as it has been originally published: [url removed, login to view] As you can see there is a menu on the left, another on the right, header and a banner at the top and a complicated table containing the article in the middle. I want to strip all these and get just the article, but with my own preferences in the properties of the style sheet. Legal: My organization holds full responsibility of presenting a clear reference to the source, which is perfectly legal in this case. Moreover, it is part of this project to make a custom label, that will be attached to each article, containing the name of the newspaper, the publishing date, section, subsection, author, original title and a brief description of category, that I will be assigning to each article.
**EDITED ON AUGUST 11, PLEASE READ**
Up to now I? have been receiving only general presentations of software developers, but no specific proposals.
So,? next I will? provide some more details to make all of you understand better, whether you can undertake the job or not. **Please tell me whether you can do it and with which programming language, also giving a bid amount.? **Please avoid Microsoft-based solutions. Thank you.
My boss likes buying (printed) newspapers and collecting articles. Every weekend he reads two specific newspapers (always the same two) and he cuts the interesting articles with a scisors. (Really!) Then he gives them to me and tells me to find them in the respective web-published edition. What I have to do, is copy the article, paste it into MS Word and then format it with the style we use for our other documents, so that these articles can be reprinted and distributed to the public. (It's an educational organization, so we usually publish material that might be of interest to intellectuals, students of social problems etc.) Then there was the internet! And it provides a very good opportunity to digitize, categorize and share this collection which for the monment counts about 1500 articles and goes on... As I said before, When I get the pieces of the newspapers, I have infront of me their name, date, author and article name. Usually these newspapers have certain categories, which may or may not be known to me, since I only get a small piece of cut-off paper. Sometimes I try guessing the category, with respect to the content of the article or to the author. (Certain authors write every week in the same column.) What I do is browse for the specific articles and once I have their URL, I hit the link which says "print this page". This shows me just the article without the menus, header, footer and banners. From there it's very easy, though manual, to view the html source and strip it from the tag: <link rel="StyleSheet" href="[url removed, login to view]" originalAttribute="href" originalPath="[url removed, login to view]" originalAttribute="href" originalPath="[url removed, login to view]" originalAttribute="href" originalPath="[url removed, login to view]" type="text/css"> With this method the page retains its tags, but they are no more defined. So, I have to open the file with MS Word and define the new properties of each span class that I need. But this needs to be done for every article! There are about ten articles from each of the two newspapers each week. The respective sites are: [url removed, login to view] and [url removed, login to view] The second newspaper doesn't use any ugly stylesheets, so I don't have to strip it, just save it and format its layout later. I was thinking of an "engine" that would have a form. And once I would enter the article URL in that form, the engine would do all the rest. Is that possible? The finally formatted articles should be stored in a database with fields for each of their properties: Newspaper, date, article title, author, newpaper column, subcolumn and indexing category. (This last one is been given afterwards by my boss, regarding the subject that the article refers to, e.g. history, technology, society, etc.)