# week8 homework v1

• Correctness of answer(s) to the questions.
• Correctness of SQL, including adherence to style.

The results should be returns on an Excel spreadsheet with each question having one worksheet. The answers should include:

• The answer to the question
• One or more sentences describing interesting observations about the results (are the values increasing, decreasing, expected, unexpected, and so on).
• The query used to generate the results. This should be formatted using a fixed width-font such as Courier New so it is clearly indented correctly.
• Results in Excel should be formatted using commas, dollar signs, and percent signs as appropriate.

Question 1 (1 point):

The idea in this question is to build a lift chart – but without building a model. The “model score” is the total population in each zip code in ZipCensus.

Create a lift table showing the proportion of the total population by decile. The horizontal axis will split the zip codes into ten equal sized groups. The vertical axis will have the proportion of the total population in the decile.

Extra credit:  Show this as a pretty chart as well.

Question 2 (2 point):
You are working for a solar energy company that wants to investigate solar energy.  You want to find 10 zip codes most similar to existing zip codes with solar households for your marketing efforts.

What 10 zip codes are most similar to the zip codes that have housing units powered by solar energy (hhpctfsolar)?  What is the similarity using Euclidean distance on standardized variables

For the purpose of this question, use the following variables:
• Household median income
• Proportion of families earning over \$100k
• Latitude
• Longitude
• Density (population per land square mile)
• Proportion of single family homes (pctunits1detached)

Use the averages of these values for zip codes that have any housing units powered by solar. Do you see any downsides to this approach?

Question 3 (2 points):

Create a look-up model that predicts the time between the order and the last shipment date based on orders shipped before 2016.  Provide a table that compares the predicted results and the actual results for orders made in 2016 by product group and number of items in the order. Explain the dimensions that you are choosing for the lookup model.

The look-up model should use at least two dimensions.

Question 4 (2 points):

Based on the subscribers data, what is the probability of a customer lasting for one year based on market, channel, rate plan, and month of start (12 months, January – December)?  Use a lookup model based on starts in 2004 and 2005 and eliminate cells that have fewer than 100 starts. Compare the predicted values to the actual values in 2006.

Provide a table ordered from the most probable to the least probable for all combinations of market, channel, rate plan, and month of start, along with the predicted one-year retention rate and the actual for 2006.  The table should have 3*4*3*12 rows.

Question 5 (3 points):

Based on the subscribers data, what is the probability of a customer lasting for one year based on market, channel, rate plan, and month of start?  Use a Naive Bayesian model based on starts in 2004 and 2005. Then create a table with ROC values by model decile and calculate the AUC for this and for the model in the previous question.

The table should order the groups from the most probable to the least probable and include the estimated retention rate and the actual results for 2006.

Are the results similar or different from the lookup model?

Hint: A fine approximation of AUC is simply the sum of the ROC values. ROC is discussed starting on page 540. Naïve Bayesian Models are explained starting on page 546.

Extra credit:  Compare the two models as a pretty picture.