Big Data project

Need: We are looking to form a team of two skilled developers that are willing to work on a big data project for the three different tasks that we will explain in a bit.

Objective: The objective is to scale the platform for future needs as well as address limitations of retrieving data by building a mini search engine with Elasticsearch.

Minimizing the cost is important factor which motivated the use of open-source tools such as: Apache NiFi

Apache Spark

Ellasandra (Cassandra + Elastic search)

User stories:

User story 1: As a user, I want to be able to import a heavy csv file to Cassandra.

User story 2: As a user, I want to be able to search for a specific field (mail, id ...) and get the ten first occurrences displayed.

User story 3: As a user I want to be able to export a csv file that has the result of the query I wrote in the search engine.

Explanation of the role of the three project main functionalities:

The user will have access to a simple graphical user interface that will let him choose between:

- Import: has as objective to allow importing heavy datafiles

- Export: has as objective to generate a text file with respect to some filter (could be a table name, or a property) specified and fetched from the search input field

- Search: has as objective to filter the data and render at most 10 data rows that matches the search query and render it to the user.

Note: This project generates only one view for the user based on the input in the search field.

Expected features:

1) Simple graphic interface that contains the 3 main Sub functionalities of this interface:

button import, button export (export can be a collection, depends on the filter and the query), Search field (to filter).

• Level of priority: Medium

• Expected programming skills: HTML / CSS / Flask(preferably but not a must)

2) Settle Elassandra cluster (2 databases). Load huge data file into the cluster in order to achieve distributed storage. And finally export the result of a specific query to a csv file.

• Level of priority: High

• Expected programming skills: python, Elasandra NoSQL (Cassandra + Elasticsearch), Apache NiFi

3)Server side for data processing: This will have two main and separate goals:

First: If one of the special fields in already existent, upsert into already existing record the missing fields from the new record and vice-versa.

Second: Implement The search algorithm that will get the expected row, table or even field

• Level of priority: high

• Expected programming skills: python, Apache Spark, Elasandra (Cassandra+Elasticsearch)

Expected result: The project will be deemed successful if we see that the user stories are met, and the database fields are being updated as explained in the section before.

Habilidades: Hadoop, Elasticsearch, Cassandra, Spark

