Hi, Need Assignment Help?

We are ready to assist you anytime.

Talk to an expert

Given two models, one being a subset of the other, which method is appropriate for selecting the best model?

Quiz 3, Data Science, Senior Seminar in Mathematics

 

1. Given two models, one being a subset of the other, which method is appropriate for selecting the best model?

ORDER NOW FOR CUSTOMIZED SOLUTION PAPERS

a. AUC

b. ROC

c. recall

d. Loglikelihood

2. Which of the following methods can be used to compare two models that are not subsets of each other?

a. AUC

b. AIC

c. Precision

d. Recall

3. Given that the recall of a model is 0.91 and the precision is 0.94, what will the F-score of this model be to 3 decimal places?

a. 0.462

b. 0.925

c. 0.791

d. 0.943

4. There are generally two types of models, those that are flexible and those that are not. Which of the following distinguishes between the two types?

a. Flexible models require fewer assumptions and inflexible models require stricter assumptions.

b. Flexible models cannot predict outcomes but inflexible models can.

c. Flexible models tend to use more data whereas inflexible models don’t.

d. Inflexible models require fewer assumptions whereas flexible models don’t.

5. The output of a linear regression is given as follows: estimated_weight(lb)=80+15*height(ft). What is the interpretation of the effect of height on weight?

a. A one-foot increase in height will increase the estimated weight by 80 pounds

b. A one-foot increase in height will decrease the estimated weight by 80 pounds

c. A one-foot increase in height will increase the estimated weight by 15 pounds

d. A one-foot increase in height will decrease the estimated weight by 15 pounds

6. From the contingency table below, what is the conditional probability of belonging to Class A given that you’re in stage Green?

 

  Class A Class B Class C Total
Green 100 40 60 200
Blue 30 60 110 200
Total 130 100 170 400

 

 

a. 50%

b. 40%

c. 30%

d. 20%

7. From the contingency table below, what is the conditional probability that a person chosen at random is in stage Blue given that she is in class C.

 

  Class A Class B Class C Total
Green 100 40 60 200
Blue 30 60 110 200
Total 130 100 170 400

 

 

a. 35%

b. 45%

c. 55%

d. 65%

8. The output from a logistic regression is given as . Which of the following statements is correct?

a. The odd increases by 40% for every one centimeter increase in width.

b. The log odds increases by 150% for every one centimeter increase in width.

c. The odds decreases by 150% for every one centimeter increase in width.

d. The log odds decreases by 40% for every one centimeter increase in width.

9. In data science applications, which of the following shows the preferred approach in using models to help increase profit for a business.

a. Gather data, understand business requirements, train model, test model, evaluate model and use model to predict future outcomes.

b. Understand business requirements, gather data, train model, test model, evaluate model and use model to predict future outcomes.

c. Gather data, train model, test model, evaluate model and use model to predict future outcomes, understand business requirements.

d. Gather data, train model, test model, understand business requirements, evaluate model and use model to predict future outcomes.

10. Given that the predicted outcome for an observation is 22.5kg and knowing that the actual for that observation is 24kg, estimate the model’s residual error based on only this observation.

a. -1.5kg

b. -2.5kg

c. 2.5kg

d. 1.5kg

11. Suppose you’re consulting on a data science project where the interest is in identifying customer behavior within a line of business, however they business owner knows there are some naturally occurring groupings but they’re not sure how many and what those groups are. Which of the following methods will be most appropriate to apply in this situation.

a. Conduct unsupervised learning to get a description of the underlying groups and label them.

b. Conduct summary analyses to get learn the description of the groups and label them.

c. Conduct supervised learning to get a description of the underlying groups and label them.

d. Conduct mean analyses to get learn the description of the groups and label them.

12. The key difference between supervised and unsupervised models is that:

a. Unsupervised model’s error can be measured with accuracy which is not the case for supervised models

b. Unsupervised models are inflexible but supervised models are not.

c. Supervised models have an outcome or target but unsupervised models do not.

d. Supervised models need a manager or supervisor but unsupervised models do not.

13. The process of transforming features to improve model performance is called:

a. Model transformation

b. Feature engineering

c. Feature transformation

d. Model engineering

14. The likelihood function of a model is

a. A function of the parameters given the data

b. A function of the data given the parameter

c. A function of both the parameter and data

d. A function of the data only

15. In a logistic regression setting, what is the role of the log link function:

a. To create a target

b. To ensure that the outcome is an odds ratio

c. To map all real value range to a probability range of [0, 1] target.

d. To map all target values to a real valued range

16. Assumptions and requirements for conducting a linear regression analysis using machine learning include (select all of the following that apply):

a. A linear relationship exists between dependent and independent variables

b. Any errors are normally distributed

c. The condition of homoscedasticity exists

d. Observations of variables are independent

e. Any relationship between dependent and independent variables can exist

f. All errors should converge to zero

g. The condition of multicollinearity ensures the best analysis

h. Assumptions and requirements do not matter because there is no way to check them

17. In the titanic dataset provided, how many missing values are there? Enter the number of missing values in the blank in the following sentence.

There are _______ missing values in the Titanic dataset.

18. Nearly all of the missing values in the titanic dataset are the ages of the passengers. Only one missing value appears outside the “age” variable and it is a missing “fare”. There are many ways to deal with missing values in datasets, e.g. using the average value of a variable. If you decide to eliminate observations with missing values for age, what percentage of the total dataset are you eliminating?

a. 20%

b. 50%

c. 5%

d. 80%

19. Based on the Titanic dataset provided, what is the best method to determine how many passengers survived?

a. Classification

b. Linear Regression

c. k-Nearest Neighbors

d. k-Means Clustering

20. A generalized linear model cannot be developed for ordinary, least squares linear regression.

a. True

b. False

21. For generalized linear models, the unity (link) function is typically used for linear regression, the logit (link) function is typically used for a logistic regression, and a probit (link) function is used to convert probabilities to z-scores.

a. True

b. False

22. Many studies have been completed about the Pima Indian incidence of diabetes. The next four questions will involve the data used in test studies and provided for you. First, complete your own analysis of the PimaIndiansDiabetesData using logistic regression. As part of your analysis create a correlation matrix and corresponding heat map that represents the variables in this dataset and show those results. Which combination of variables have the highest correlation?

a. Age and Pregnancies

b. Age and Blood Pressure

c. Blood Pressure and Glucose

d. Body Mass Index (BMI) and Insulin

23. Using a 75/25 split for your training and test datasets, verify that the categories associated with age have the same distribution in both the complete dataset and in your training dataset. Once you have verified that is true, proceed with your logistic regression. How many variables are statistically significant with an alpha less than or equal to 0.05?

a. 2

b. 5

c. 7

d. 9

24. With a threshold value of 0.5, build a confusion table based on your logistic regression for the Pima Indian Diabetes dataset. What is the accuracy of your model as a percentage?

a. 25%

b. 78%

c. 52%

d. 96%

25. Last, build and include a ROC curve for your logistic model of the Pima Indian Diabetes dataset with your quiz answers. Since the area under the curve is known as the “absolute value of the quality of the prediction,” what is the value of the quality of your model and are you “just guessing” or have you come close to a “perfect prediction”?

a. .33 – Just guessing

b. .56 – still just guessing

c. .84 – pretty close to a perfect prediction

d. .99 – a perfect prediction

Page 4 of 5

Get Help With Your Assignment.

We have worked on a similar assignment and our student scored better and met their deadline. All our tasks are done from scratch, well researched and 100% unique, so entrust us with your assignment and I guarantee you will like our services and even engage us for your future tasks. Click below button to submit your specifications and get order quote

Free Inquiry Order A Similar Paper Cost Estimate