Tutorial on Linear Regression

01 February 2018

I was a section (tutorial) leader for the CS209a Data Science course at Harvard. Here are my notes on Linear Regression. Skip down to the attached notebook to see a fun problem that relates to the notes that are here.

Homework Problem (with solutions)

from io import BytesIO
from zipfile import ZipFile
import urllib
import os

# Note that you may need to run the following command to install Python Image Library (PIL)
#pip install Pillow
from PIL import Image
import numpy as np
from sklearn.cross_validation import train_test_split

import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline

# starter functions provided to students
def rgb2gray(rgb):
    '''
    function to convert RGB image to gray scale
    accepts 3D numpy array and returns 2D array with same dimensions
    as the first two dimensions of input
    '''

    return np.dot(rgb[...,:3], [0.299, 0.587, 0.114])

def fetch_and_read_data(shape=(50,30)):

    '''
    Function to download image data, store in a local folder (note this is 18.4mb), only download the data when
    the local folder is not present, read in the images, downsample them to the specified shape (default = (50x30) (rows x cols))
    and finally split them into a four tuple return object.

    Returns:
        - 1) training image data (i.e. images that should form the predictor matrix in your solution)
        - 2) training image data labels (i.e. labels from 1 to 50 that identify which face (1) belongs to)
        - 3) testing image data (i.e. data that you should use to try and classify - note this forms the predictor variable in your regression)
        - 4) testing image data labels (i.e. the labels for (3) - this is to allow you to evaluate your model)

    ___________________
    Aside:
    If you want to change the sampling dimensions of your data, pass the shape = (x,y) argument to the method where
    y is the number of columns and x is the number of rows in the image.
    '''

    if not os.path.exists('./cropped_faces'):
        url = urllib.request.urlopen("http://www.anefian.com/research/GTdb_crop.zip")

        zipfile = ZipFile(BytesIO(url.read()))
        zipfile.extractall()

    data = []
    labels = []

    files = os.listdir('cropped_faces')
    for f in files:
        if '.jpg' in f:
            image = Image.open('cropped_faces/' + f)
            image = image.resize((shape[1], shape[0]))
            data.append(rgb2gray(np.asarray(image)))
            labels.append(int(f.split('_')[0][1:]) - 1)

    data = np.array(data)

    trainX, testX, trainY, testY = train_test_split(data, labels, test_size=0.2, stratify=labels)
    return np.array(trainX), np.array(testX), np.array(trainY), np.array(testY)

# starter code for the students
train_dataset, test_dataset, train_labels, test_labels = fetch_and_read_data()

# code to plot some of the images
fig, axes = plt.subplots(2,4,figsize=(10,5))
axes = axes.flatten()
[axes[i].imshow(train_dataset[i], cmap='gray') for i in range(len(axes))]
plt.show()

png

APCOMP209a - Homework Question

Read Sections 1 and 2 of this paper.

Briefly, we have a number of cleaned images of people’s faces. The model leverages the concept that “patterns from a single-object class lie on a linear subspace” and the fact that linear regression can be thought of as an orthogonal projection of the response vector (Y) onto the subspace spanned by the columns of the predictor matrix (X).

Question 1

Question 1a

As discussed in the linked paper, we have face data that has images of faces belonging to different people. Going with the assumption that patterns (faces) of one type (from one person) form part of the same linear subspace, let’s try to classify some unknown faces by the same method presented in the paper. In other words, construct hat (H) matrices from known faces (make sure you follow the column concatenation step described in the paper to convert an image into a vector representation), and project the unknown face vectors onto the subspaces spanned by the various H matrices. Calculating the minimum distance between the original vector and the projection should allow you to make a classification of the face.

Notes: - Use the provided code to download and re-sample the dataset. - Follow the normalisation step in the paper to ensure the “maximum pixel value is 1”. - Your classifier should have approximately an 80% accuracy - Use the image plotting library of matplotlib to display one (or two) correctly classified faces and the known faces. - Use the image plotting library of matplotlib to display one (or two) incorrectly classified faces and the known faces.

Text ansewer required: - provide a reasonable explination for the mis-classified faces.

Question 1b - Significant Faces

Select an example of a correctly classified face. Use statsmodels to investigate the most predictive columns (faces) that the model used in this regression:

(i) Which columns (i.e. faces) make the highest contribution to the projection?

(ii) Which columns (i.e. faces) are the least useful in making this projection?