Notes on: Introducing statistical learning

Few months ago I was reading chapter 1 from “An Introduction to Statistical Learning (ISL)” by James, Witten, Hastie and Tibshirani. These are some notes.

Statistical learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. Supervised statistical learning involves building a statistical model for predicting an output based on one or more inputs. With unsupervised statistical learning, there are inputs but no supervising output; nevertheless the goal in this case is learning the patterns in such data.

Some examples of problems that statistical learning techniques try to solve are:

  • Regression problems with a quantitative output: trying to predict a quantitative output like the wages of a sample of individuals from the value of other variables like age, calendar year, and education
  • Classification problems with a qualitative or categorical output: analyzing the Standard & Poor’s 500 (S&P) stock index daily movements over a 5-year period between 2001 and 2005, to predict whether the index will increase or decrease on a given day using the past 5 days percentage changes in the index
  • Clustering problems: analyzing gene expression measurements for cell lines. Instead of predicting a particular output variable, the goal is determining whether there are clusters among the cell lines based on their gene expression measurements. In these cases we can try to represent the cells in a limited number of dimensions and then find similarities between the cells so that we can group them in clusters. The accuracy of this technique is related to the selection of the appropriate dimensions

The term statistical learning is fairly new. Following some important events in the evolution of this field:

  • Early nineteenth century: Legendre and Gauss published papers on the method of least squares for predicting quantitative values
  • 1936: Fisher introduces the linear discriminant analysis to predict qualitative variables
  • 1940: various authors developed logistic regression
  • Early 1970s: Nelder and Wedderburn coined the term generalized linear models for an entire class of statistical learning methods that include both linear and logistic regression as special cases
  • Mid 1980s: Breiman, Friedman, Olshen and Stone introduced classification and regression trees, and were among the first to demonstrate the power of a detailed practical implementation of a method, including cross-validation for model selection
  • 1986: Hastie and Tibshirani coined the term generalized additive models for a class of non-linear extensions to generalized linear models, and also provided a practical software implementation

Lessons learned:

  • Statistical learning refers to tools for understanding data
  • These tools can be classified as supervised or unsupervised
  • Supervised SL involves building a statistical model for predicting, or inferring an output given some inputs
  • Unsupervised SL, there are inputs but no supervising output. The goal is to learn from relationships

In which fields you would like to apply the statistical learning tools?

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s