(a)(15 points) The simplest version of randomizedresponse involves flipping a single fair coin (50% probability of heads and 50% probability of tails). Suppose an individual is asked a potentially incriminating question, and flips a coin before answering. If the coin comes up tails, he answers truthfully, otherwise he answers yes. Is this mechanism differentially private? If so, what epsilon value does it achieve? Carefully justifyyour answer.

[March 28 Note: This mechanism is different from the mechanism we discussed in class.]

Problem 3 (35 points): Privacy-preserving synthetic data

In this problem, you will take on the role of a data owner who owns two sensitive datasets, called hw_compas and hw_fake , and is preparing torelease differentially private synthetic versions of these datasets.

The first dataset, hw_compas is a subset of the datasetreleased by ProPublica as part of their COMPAS investigation. The hw_compas dataset has attributesage, sex, score, and race, with the following domains of values: age is an integer between 18 and 96, sex is one of Male or Female, score is an integer between -1 and 10, race is one of ‘Other’, ‘Caucasian’, ‘African-American’, ‘Hispanic’, ‘Asian’, or ‘Native American’.

The second dataset, hw_fake , is a synthetically generateddataset. We call this dataset fake rather than synthetic because you will be using it as input to a privacy-preserving data generator. We will use the term synthetic to refer to privacy-preserving datasets that are produced as output of a data generator.

We generated the hw_fake dataset by sampling fromthe following Bayesian network:

In this Bayesian network, parent_1 , parent_2 , child_1 ,and child_2 are random variables. Each of these variables takes on one of three values {0, 1, 2}. Variables parent_1 and parent_2 take on each of thepossible values with an equal probability. Values are assigned to these random variables independently. Variables child_1 and child_2 take on the value ofone of their parents. Which parents value the child takes on is chosen with an equal probability.

To start, use theData Synthesizer libraryto generate4 synthetic datasets for each sensitive dataset hw_compas and hw_fake (8 synthetic datasetsin total), each of size N=10,000, using the following settings: A: random mode B: independent attribute mode with epsilon = 0.. C: correlated attribute mode with epsilon = 0.1 , withBayesian network degree k= D: correlated attribute mode with epsilon = 0.1 , withBayesian network degree k=

For guidance, you can use theHW2_Templatehere. Wehave provided the code to generate the 4 synthetic datasets for you. Please make sure to duplicate this file rather than write your code directly here.

(a) (15 points) : Execute the following queries onsynthetic datasets and compare the results to those on the corresponding real datasets:

 Q1 ( hw_compas only): Execute basic statistical queriesover synthetic datasets.
The hw_compas has numerical attributes age and score. Calculate the median, mean,
min, max of age and score for the synthetic datasetsgenerated with settings A, B, C,
and D (described above). Compare to the ground truth values, as computed over
hw_compas. Present results in a table. Discuss theaccuracy of the different methods in
your report. Which methods are accurate and which are less accurate? If there are
substantial differences in accuracy between methods - explain these differences.
 Q2 (hw_compas only): Compare how well random mode (A) and independent attribute
mode (B) replicate the original distribution.
Plot the distributions of values of age and sex attributesin hw_compas and in synthetic
datasets generated under settings A and B. Compare the histograms visually and
explain the results in your report.
Next, compute cumulative measures that quantify the difference between the probability
distributions over age and sex in hw_compas vs. inprivacy-preserving synthetic data.
To do so, use the Two-sample Kolmogorov-Smirnov test (KS test) for the numerical
attribute and Kullback-Leibler divergence (KL-divergence) for the categorical attribute,
using provided functions ks_test and kl_test. Discussthe relative difference in
performance under A and B in your report.
 Q3 ( hw_fake only): Compare the accuracy of correlatedattribute mode with k=1 (C) and
with k=2 (D).
Display the pairwise mutual information matrix by heatmaps, showing mutual information
between all pairs of attributes, in hw_fake and intwo synthetic datasets (generated
under C and D). Discuss your observations in your report, noting how well / how badly
mutual information is preserved in synthetic data.

(b) (10 points, hw_compas only ) : Study the variabilityin the mean and median of age for synthetic datasets generated under settings A, B, and C.

To do this, fix epsilon = 0.1, and generate 10 synthetic datasets (by specifying different seeds). Calculate the mean and median of age for each of the10 datasets. Plot the 10 median values and the 10 mean valuesusing a box-and-whiskers plot.Compare these metrics to the ground truth median and mean from the real data. Carefully explain your observations: which mode gives more accurate results and why? In which cases do we see more or less variability?

Specifically for the box-and-whiskers plots, we expect to see two subplots: one for the mean and one for the median, withthe three settings (A,B and C) along the X-axis and age on the Y-axis. You should include these plots in your report.

(c) (10 points, hw_compas only ) : Study how well statisticalproperties of the data are preserved as a function of the privacy budget, epsilon. To see robust results, execute your experiment with 10 different synthetic datasets (with different seeds) for each value of epsilon, for each data generation setting (B, C, and D). Specifically, you should:

 Compute the KL-divergence over the attribute race in hw_compas. For each setting
(B, C, and D), vary epsilon from 0.02 to 0.1 in increments of 0.02. Specifically, the
epsilons are [0.02, 0.04, 0.06, 0.08, 1]. In total, you should generate 3*10*6 synthetic

datasets and calculate the KL-divergence for race in each dataset. Create three box-and-whiskers plots, one for each setting (B, C, D). Each plot should have epsilon on the X-axis and KL-divergence on the Y-axis. Discuss your findings in the report and include your plots.