You will be implementing a perl or php webapplication that will
allow us to manage programmers/scriptwriters that will
write diffrent web extraction scripts for us. So we
will assing websites to scriptwriters and they will
write perl scipts that will extract certain data from
websites and they will upload the scripts to our server,
then we will approve the scripts and after approval
the scripts have to be executed as cronjobs at give time
intervals. The users will have to write two types of scipts:
for extracting job offers from comapny pages, we will call
this CP scripts and for extracting jobs from jobboards,
we will call this JB scripts. The diffrence is the type
of data those scripts are crawling for and what they do
with that data, but this is not your responsibility, you
only have to give to the users the posibility to upload
the scripts and actulay run those scripts as cronjobs.
The application will have the following sections:
- login/sign up screen: this will allow scriptwriters
to login to the system, and also to sign up for a new
account, the sign up form must have captcha protection.
By default we will have an admin account that will have
access to extended features, like adding/deleting/update
acounts and very import we also have to approve accounts,
so by default all new scriptwriters accouts will not
be approved and only approved users are taken into consideration
for next functions.
- When a scriptwriter first login he will see a screen
with his "to do" tasks, this will be a list of urls they
have to extract the data from. For displaying this you
will get access to 4 tables: the company page URLs table, the
jobboards page URLs table, the scriptwriter to company page URLs
linking tables (this table will save the scriptwriter
id linked with the site id so we will know what sites this
user is suposed to crawl), the scriptwriter to jobboards
liking table (this table will save the scriptwriter
id linked with the jobboard id so we will know what sites this
user is suposed to crawl). So basicaly in the main page
you have to read what company urls and what joboards urls
are assigned to current user and display it so he knows
what scripts he must write, those tables will alredy be
there for you and it will be added from a diffrent application.
Also the user must know the type for each url , that can
be JB for jobboards, and CP for company pages.
- Next we need a feature to upload the scripts made by the user,
so for each jb/cp urls we need an upload feature, the user will
start writing a diffrent perl script for each url, the CP scripts
will extract data and save it to an XML format on disk on a
specified path, the JB scripts will extract the data to a specified
table on our server. Now we need a feature to upload those scripts
somewhere on disk on the server, so make a folder on disk,
/home/uploaded_scipts. Then inside it we will have:
/home/uploaded_scipts/cp/$id/ and /home/uploaded_scipts/jb/$id/
where $id is the id of the user, so for example if user id
is 35, then we will upload all the cp scipts to
/home/uploaded_scipts/cp/35/ and all his jb scipts to
/home/uploaded_scipts/jb/35/.
- Also we need a page that is easy to be edited with
instructions for the scriptwriters, because for JB scripts
they must know the database server configutation the scripts
they make will have to use to be able to connect to our
mysql server after they upload the files to our server, (use
some dummy data for the moment), and CP scipts we need to dispaly
for them the path on the server where the XML files should be saved.
This will be by default into /home/jmxml/$id/import_$siteid_$[login to view URL],
where $id is the scriptwriter id and $counter is a variable they
increment by one for each new file and $siteid is the site id
for that company url. So we need to dispaly all this info for the
scriptwriters so they know how to code the scipts they make.
- Next from the admin user we need a screen where all scipts
have to be approved. By default all scripts will not be approved.
When the script is approved it has to be set to run as a cronjob,
and we should be able to select the time interval from:
-> 24h (every day)
-> 48 hours (every second day)
-> 72 hours (every third day)
-> 144 hours (once a week)
-> 288 hours (every 2 weeks)
-> 572 hours (once a month)
You have to find a solution to automaicaly set all scipts from
/home/uploaded_scipts/ to start based on the intervals we select
at approval. By default JB scripts will get executed every 7 days
and CP scripts once per day.