CSCI E-63 Big Data Analytics
The emphasis of this course is on mastering the most important big data technology: Spark 2 and its various application programming interfaces (APIs). Spark is an evolution of Hadoop and Map/Reduce with massive speedup and scalability improvements. The explosion of social media and the computerization of every aspect of social and economic activity results in the creation of large volumes of semi-structured data: web logs, videos, speech recordings, photographs, e-mails, Tweets, and similar data. In a parallel development, computers keep getting ever more powerful and storage ever cheaper. Today, with Spark 2, we can reliably and cheaply store huge volumes of data, efficiently analyze it, and extract business and socially relevant information. In this course students examine Spark Core and Spark Streaming, which allows analysis of data in flight, that is, in near real time. Students learn how to use Spark GraphX, an in-memory graph database, to analyze highly connected data. Students acquire practical skills in scalable messaging systems like Kafka and Akka and learn to integrate Spark with several NoSQL systems. Students conduct some exercises in the cloud and master the most important cloud services. At the end of the course, students are able to initiate and design highly scalable systems that can accept, store, and analyze large volumes of unstructured data in batch mode and/or real time. This is not a course in statistics, but some of the most essential statistical techniques are used and learned.