I have a .csv Excel spreadsheet over 40,000 rows and three relevant columns. The first is an ID. The second and third are title text and body text for each of the rows. I am looking for someone to create "bag of words" .csv's in the following manner. Needed 4 csv files
1) First csv Just using title text only from all 40k rows or so where each column is a unique word and the entries are the # of occurrences.
2) Second csv same thing except for body text (should be a wider matrix).
3) Third csv is combining the title & body texts as one.
4) 4th csv just title text with single words as columns and also 2-word phrases as columns. For example "The cats are cats." would be columns called: the, cats, are, the cats, cats are, are cats. The entries would be 1, 2, 1, 1, 1, 1.
Each csv is a matrix. We are willing to award this and do it very quickly. Preference will be given to speed, ability to get rid of "garbage words" or periods at end of word (., the, a, etc.), and in general anyone with experience doing this. Anyone who has good ideas of how to make the matrices cleaner are also very welcome too, for example, reducing words to their roots (booked or books --> book). We are open to new ideas too.
Thank you! File will be sent once project accepted.