I am working in the area of big data analytics for more than 5 years. I have experience in handling various machine learning project and natural language processing. I have worked on training and building the model using Mahout, Weka, R, Mallet, GATE, OpenNLP, StanfordNLP. I have also experience in training algorithm for large scale data analysis using Hadoop. I have implemented algorithms in map-reduce and tested the application in multi-node environment for more than 50 large instances. I also have experience in algorithms using SVM, Naive Bayes, Neural Network, Decision Trees, Clustering, LDA. I have used NoSQL for distributed environment. I have setup Amazon Ec2 web application and then used scaled web application for large scale usage for even 10,000 users using Amazon Elastic Load Balancing and Auto scaling. I want to know what is the specific work that needs to be done. I am also trying other map-reduce based architectures like Disco, H20 and R-Hadoop.
1) What is the training set that is going to be used for the project?
2) Should the algorithm implemented be in map-reduce architecture ?
3) What is the size of the dataset to be trained for the model ?
4) Is it a large scale web application which also needs to be scaled up/down according to the demand of the users ?
5) What is the training/testing accuracy prediction success rate to be hit ?
If I have good understand of the application, then I can move forward to work in this project.