CSCI E-63 Big Data Analytics
The emphasis of this course is on mastering the most important big data technology: Spark 2 and its various application programming interfaces (APIs). Spark is an evolution of Hadoop and Map/Reduce with massive speedup and scalability improvements. The explosion of social media and the computerization of every aspect of social and economic activity results in the creation of large volumes of semi-structured data: web logs, videos, speech recordings, photographs, e-mails, Tweets, and similar data. In a parallel development, computers keep getting ever more powerful and storage ever cheaper. Today, with Spark 2, we can reliably and cheaply store huge volumes of data, efficiently analyze it, and extract business and socially relevant information. In this course students examine Spark Core, Spark machine learning (ML) API, and Spark Streaming which allows analysis of data in flight, that is, in near real time. Students learn how to use Spark GraphX, an in-memory graph databases, to analyze highly connected data. Students acquire practical skills in scalable messaging systems like Kafka and Akka and learn to integrate Spark with NoSQL systems. Students conduct some exercises in Amazon Cloud, so they can master the most important Amazon Web services, EC2 and S3. At the end of the course, students are able to initiate and design highly scalable systems that can accept, store, and analyze large volumes of unstructured data in batch mode and/or real time.