An Introduction with Applications in Data Science

This is a textbook in probability in high dimensions with a view toward applications in data sciences. It is intended for doctoral and advanced masters students and beginning researchers in mathematics, statistics, electrical engineering, computer science, computational biology and related areas, who are looking to expand their knowledge of theoretical methods used in modern research in data sciences.

Data sciences are moving fast, and probabilistic methods often provide a foundation and inspiration for such advances. A typical graduate probability course is no longer sufficient to acquire the level of mathematical sophistication that is expected from a beginning researcher in data sciences today. The proposed book intends to partially cover this gap. It presents some of the key probabilistic methods and results that should form an essential toolbox for a mathematical data scientist. This book can be used as a textbook for a basic second course in probability with a view toward data science applications. It is also suitable for self-study.

The essential prerequisites for reading this book are a rigorous course in probability theory (on Masters or Ph.D. level), an excellent command of undergraduate linear algebra, and general familiarity with basic notions about Hilbert and normed spaces and linear operators. Knowledge of measure theory is not essential but would be helpful.

Buy it on Amazon or Cambridge University Press.

Download the final draft of the book for free:

(Warning: large file, please be patient with download.)

Use this draft at your own risk, and only for your personal and classroom needs. Please do not distribute the copy.

Here are a few useful sources, which cover some of the material that is included in the textbook. Some of them require more advanced background than this textbook does.

- R. Vershynin, Four lectures on probabilistic methods for data science. 2016 PCMI Summer School, AMS, to appear.
- R. Vershynin, Introduction to the non-asymptotic analysis of random matrices. Compressed sensing, 210--268, Cambridge Univ. Press, Cambridge, 2012.
- P. Rigollet, High-dimensional statistics. Lecture notes, Massachusetts Institute of Technology, 2015.
- A. Bandeira, Ten lectures and forty-two open problems in the mathematics of data science, Lecture notes, 2016.
- S. Boucheron, G. Lugosi and P. Massart, Concentration inequalities, Oxford University Press, 2013.
- T. Tao, Topics in random matrix theory, AMS, 2012.
- M. Ledoux, Concentration of measure phenomenon, AMS, 2001.
- Y. Plan, Probability in high dimensions, graduate course at UBC.
- R. Vershynin, High-dimensional probability, graduate course at UM.
- R. van Handel, Probability in high dimension, ORF 570 Lecture notes, Princeton University, 2014.
- D. Chafai, O. Guedon, G. Lecue, A. Pajor, Interactions between compressed sensing, random matrices and high-dimensional geometry, preprint.
- R. Vershynin, Lectures in geometric functional analysis, unpublished, 2009.

**September 30, 2018.**The book is published in the United States. Buy it on Amazon.**August 21, 2018.**The book is in press now. It is going to be released in September. Stay tuned!**June 7, 2018,**Minor corrections made in the first proofs.**May 25, 2018.**The book is available for pre-order on Amazon.**April 18, 2018.**Multiple minor corrections at the copy-editing stage. The book is going to production now.**February 9, 2018.**Final polishing (including more references) due to feedback of the readers. The book is asbout to go into copy-editing.**January 23, 2018.**More polishing was done. Many figures look nicer thanks to my student Jennifer Bryson.**December 27, 2017.**Thanks to the feedback of the readers, multiple clarifications and corrections were incorporated.**August 24, 2017.**The entire book is now ready to be published. The preface may still be expanded a bit, but the technical material is complete.**June 8, 2017.**Chapter 11, and thus the whole book, is now polished. I will make one more (third) pass over the book, adding some exercises and the preface.**June 7, 2017.**Chapter 10 is now polished. Section 10.5.2 on the restricted isometry property is added.**June 2, 2017.**Chapters 8 and 9 are now polished. Section 9.2.3 is added, where we quickly derive Koltchinskii-Lounici bounds on covariance estimation from matrix deviation inequality.**May 23, 2017.**An "Appetizer" added to the front of the book. It presents the so-called Maurey's empirical method, which is an elegant and elementary application of probability to bound covering numbers of sets. Chapter 7 is now polished.**April 27, 2017.**Chapter 6 is now polished.**April 20, 2017.**Chapter 5 is now polished. I cleaned up the guarantees of covariance estimation both in this chapter and those appeared earlier in Chapter 4.**February 23, 2017.**Chapter 4 is now polished. I added an application to error correction codes in Section 4.3 and rewrote the application for covariance estimation in Section 4.7.**February 9, 2017.**Chapter 3 is now polished. I added a section (3.7) on kernel methods and Krivine's proof of Grothendieck's inequality, which gives (almost) the best known bound on the constant.**January 20, 2017.**Chapter 2 is now polished.**January 4, 2017.**Chapter 1 has been polished. The difficulty of exercises will be indicated by the number of coffee cups one may need to solve them.**December 21, 2016.**Numerous typos and inaccuracies fixed throughout the book. It was then converted into the publisher's style, which miraculously reduced the number of pages by 50!**December 20, 2016.**A short version of this book, condensed into just four lectures, can be found here.**November 15, 2016.**Two big sections are added in Chapter 8: VC dimension and applications in statistical learning theory.**October 24, 2016.**A few applications are added to Chapter 3: Grothendieck's inequality, semidefinite programming, and maximum cut for graphs.