The recent explosion of social media and the computerization of every aspect of economic activity resulted in the creation of big data: mountains of mostly unstructured data in the form of web logs, videos, speech recordings, photographs, e-mails, and tweets. In a parallel development, computers kept getting ever more powerful and storage ever cheaper. Today, we have the ability to reliably and cheaply store huge volumes of data, efficiently analyze them, and extract business and socially relevant information. This course brings together several key information technologies used in manipulating, storing, and analyzing big data. We look at the basic tools for statistical analysis, R, and key methods used in machine learning. We review MapReduce techniques for parallel processing and Hadoop, an open source framework that allow us to cheaply and efficiently implement MapReduce on Internet scale problems. We touch on related tools that provide SQL-like access to unstructured data: Pig and Hive. We analyze so-called NoSQL storage solutions exemplified by HBase for their critical features: speed of reads and writes, data consistency, and ability to scale to extreme volumes. We examine memory resident databases and streaming technologies which allow analysis of data in real time. We work with the public cloud as unlimited resource for big data analytics. Students gain the ability to design highly scalable systems that can accept, store, and analyze large volumes of unstructured data in batch mode and/or real time. Acquired techniques could be profitably used in a variety of fields. (4 credits)
knowledge of Java.