I have a .json file which has aprx 12 mb of data(completely unstructured), which i want to put it into R software to get clustered and then want to use that file in hadoop.
I am almost done but still some issues so need to debug it which will be around 1 hr job.
Need to submit a assignment in college.
You are working for a micro blogging social media company that manages and analyses a large number of social media messages in real-time. Each of these messages is short in length (i.e. characters) and they have a number of other features associated with them (such as the author, time created, location, etc).
The company has asked you to develop a prototype solution that will store all the tweets in a distributed file system (HDFS) and to perform some analytics on the data. You will be given a file containing a large number of tweets which you must store in an appropriate manner. Upload the data given in the file tweetsStream.zip onto the HDFS. Using appropriate tools analyse the data as per the stages given below.
The file tweetsStream.zip can be downloaded from Moodle.
a) Extract a large sample of micro-blogs (records) from those stored in your distributed data store and load them into an R data frame. (16%)
b) Perform k-means clustering on this sample by clustering on the actual textual contents of each record. Choose K appropriately and give a brief justification as to why you choose this value. (17%)
c) Create a word-cloud for each of the individual clusters. What measure of term-important is ascribed to each of the terms in the word-clouds? Discuss the output of the each of the word-clouds in terms of the ability of k-means to cluster appropriately. (17%)
Using Pig or Hive or both analyse the data according to the requirements given in (a) to (i):
a) Give details of any tools or any code that you may have used to prepare the data.
Give reasons and justification for your methodology. (5%)
b) How have you structured the data (the schema)? (5%)
c) How many records are in the file and how many records were blank? (6%)
d) How many tweets are in English? (5%)
e) Break down the tweets in English by country. (6%)
f) Which top four countries have the most active tweeters? (6%)
g) Name the top 10 tweeters and the number of tweets. (5%)
h) List the geo locations (longitude and latitude) of all the tweeters in the UK, and
another list for Europe. Use these lists to create two heat maps:
i) Discuss your results from (a) – (h). (6%)
The coursework will consist of a report that you will need to upload onto the Teachmat coursework upload page. Your report must contain the following sections:
a) Describe how you created your sample in deliverable 1 (a).
b) Briefly describe the k-means algorithm and explain how you set K.
c) Show screen shots of each of the clusters. Discuss the output of these clusters with respect to the underlying clustering algorithm.
d) Outline some future modifications you might make to the analytics section that might be beneficial for the company (or the micro-bloggers). What insights would be brought about by these modifications?
Details from deliverable 1 (a) – (i).
• You will be required to pre-process the data before uploading to the HDFS. You may use any tool of your choice to query the data.
• It is suggested that your use R for the analytics section.
2 freelancers are bidding on average ₹17092 for this job
6+ years of experience in machine learning and master degree holder in computer science. Expertise in R,WEKA, JAVA,PYTHON, Hadoop, MapReduce. Worked on many projects in machine learning.