CSCI E-63 Big Data Analytics
Fall term 2017 CRN 15499
The emphasis of the course is on mastering two of the most important big data technologies: Spark 2 and deep learning with TensorFlow. Spark is an evolution of Hadoop and Map/Reduce but with massive speedup and scalability improvements. TensorFlow is Google's open-source framework for distributed neural networks-based machine learning. The explosion of social media and the computerization of every aspect of social and economic activity results in the creation of large volumes of semi-structured data: web logs, videos, speech recordings, photographs, e-mails, Tweets, and similar data. In a parallel development, computers keep getting ever more powerful and storage ever cheaper. Today, we can reliably and cheaply store huge volumes of data, efficiently analyze them, and extract business and socially relevant information. This course familiarizes the students with the most important information technologies used in manipulating, storing, and analyzing big data. We examine the basic tools for statistical analysis, R and Python, and several machine learning algorithms. We examine Spark Core, Spark ML (machine learning) API, and Spark Streaming which allows analysis of data in flight, that is, in near real time. We learn to use TensorFlow for several standard practices including regression, clustering, and classification. We learn about so-called noSQL storage solutions exemplified by Cassandra for their critical features: speed of reads and writes, and the ability to scale to extreme volumes. We learn about memory-resident databases and graph databases (Spark GraphX and Ne4J). We acquire practical skills in scalable messaging systems like Kafka and Amazon Kinesis. We conduct most of our exercises in Amazon Cloud, so students master the most important AWS services. By the end of the course, students are able to initiate and design highly scalable systems that can accept, store, and analyze large volumes of unstructured data in batch mode and/or real time. Most lectures are presented using Python examples. Some lectures use Java and R.