Week2 HW

Libraries

library(tidyverse)
library(GGally)
library(ISLR)
library(boot)

R for Data Science Problems

7.3.4.2 Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

ggplot(data = diamonds)+
  geom_histogram((aes(price)))

Right tailed.

ggplot(data = diamonds)+
  geom_histogram((aes(price)), binwidth = 100)

Why is there a huge gap? What is the cheapest value of diamonds and why are so few made? (326)

7.4.1.1 What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

Bar counts missing values while histogram removes them.In histogram, x value needs to be numeric, this is not the case for Bar as it takes on catagorical values and just treats NA as another value.

7.4.1.2 What does na.rm = TRUE do in mean() and sum()? Calculate the mean and sum without the missing values.

7.5.1.5 Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_histogram(mapping = aes(x = price), binwidth = 100)+
  facet_wrap(~ cut, nrow = 2)

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_violin(mapping = aes(x = cut, y = price))

Geom Pros Cons
geom_violin() Easy to compare and no transformation needed Don’t really see any
facetted geom_histogram() Easier to see individual distributions Hard to compare different distributions
colored geom_freqpoly() Easy to compare if we change y to density Hard to see differences when just looking at count

ISLR Problems

4.7.5

We now examine the differences between LDA and QDA.

(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set? LDA on test, QDA on training (b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set? QDA on both

(c) In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

Unchaged, it will depend on the underlying decision boundary.

4.7.6

6. Suppose we collect data for a group of students in a statistics class with variables X1 = hours studied, X2 = undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coefficient

(a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.

exp(-6 +40*.05+3.5*1)/(1+exp(-6 +40*.05+3.5*1))
[1] 0.3775407

(b) How many hours would the student in part (a) need to study to have a 50 % chance of getting an A in the class?

exp(-6 +50*.05+3.5*1)/(1+exp(-6 +50*.05+3.5*1))
[1] 0.5

5.4.2 We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of n observations.

(a) What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer. 1 - 1/n, there is a 1/n chance of choosing a given observation. (b) What is the probability that the second bootstrap observation is not the jth observation from the original sample? 1 - 1/n (c) Argue that the probability that the jth observation is not in the bootstrap sample is (1 − 1/n)^n. there is a 1-1/n probability of a given sample not being a specific bootstap sample, since there are n bootstrap observations, we can easily find the probability of a given observation not being in the bootstrap sample.

(d) When n = 5, what is the probability that the jth observation is in the bootstrap sample? (1-1/5)^5 is the probability it isn’t in the bootstrap sample, so

1-(1-1/5)^5
[1] 0.67232

(e) When n = 100, what is the probability that the jth observation is in the bootstrap sample?

1-(1-1/100)^100
[1] 0.6339677

5.4.5 5. In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

(a) Fit a logistic regression model that uses income and balance to predict default.

logit = glm(default~income+balance, data = Default, family = binomial)
logit

Call:  glm(formula = default ~ income + balance, family = binomial, 
    data = Default)

Coefficients:
(Intercept)       income      balance  
 -1.154e+01    2.081e-05    5.647e-03  

Degrees of Freedom: 9999 Total (i.e. Null);  9997 Residual
Null Deviance:      2921 
Residual Deviance: 1579     AIC: 1585

(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps: i. Split the sample set into a training set and a validation set.

(c) Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

(d) Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

5.4.6 We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

Default

(a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

model1 = glm(default~income + balance,data = Default, family = "binomial")
summary(model1)$coefficients[,2]
 (Intercept)       income      balance 
4.347564e-01 4.985167e-06 2.273731e-04 

(b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn = function(data, index){
  model = glm(default~income + balance,data = data[index,], family = "binomial",subset=index)
  return(summary(model)$coefficients[-1,2])
}

boot.fn(Default, 1:10000)
      income      balance 
4.985167e-06 2.273731e-04 

(c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

set.seed(1)
boot(data = Default, statistic = boot.fn, R = 100) #I'm I reading this correctly?

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Default, statistic = boot.fn, R = 100)


Bootstrap Statistics :
        original       bias     std. error
t1* 4.985167e-06 3.745402e-08 1.972904e-07
t2* 2.273731e-04 1.662875e-06 1.602112e-05

(d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function. Extremely similar. Income is the same up to eight decimal places and balance is the same up to 6 decimal places

