Note: The default ITS GitLab runner is a shared resource and is subject to slowdowns during heavy usage.
You can run your own GitLab runner that is dedicated just to your group if you need to avoid processing delays.

Commit f8aeaaec authored by Liwen Huang's avatar Liwen Huang
Browse files

Upload New File

parent 9dfd7f34
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Question 3\n",
"A correlation is a statistical relationship between two variables. If we wanted to know if vaccines work, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease [1]. In this question, you are to see if there is a correlation between having had the chicken pox and the number of chickenpox vaccine doses given (varicella).\n",
"\n",
"Some notes on interpreting the answer. The `had_chickenpox_column` is either 1 (for yes) or 2 (for no), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine. A positive correlation (e.g., `corr > 0`) means that an increase in `had_chickenpox_column` (which means more no's) would also increase the values of `num_chickenpox_vaccine_column` (which means more doses of vaccine). If there is a negative correlation (e.g., `corr < 0`), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.\n",
"\n",
"Also, `pval` is the probability that we observe a correlation between `had_chickenpox_column` and `num_chickenpox_vaccine_column` which is greater than or equal to a particular value occurred by chance. A small `pval` means that the observed correlation is highly unlikely to occur by chance. In this case, `pval` should be very small (will end in e-18 indicating a very small number).\n",
"\n",
"[1] This isn't really the full picture, since we are not looking at when the dose was given. It's possible that children had chickenpox and then their parents went to get them the vaccine. Does this dataset have the data we would need to investigate the timing of the dose?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def corr_chickenpox():\n",
" import scipy.stats as stats\n",
" import numpy as np\n",
" import pandas as pd\n",
" \n",
" # this is just an example dataframe\n",
" df=pd.DataFrame({\"had_chickenpox_column\":np.random.randint(1,3,size=(100)),\n",
" \"num_chickenpox_vaccine_column\":np.random.randint(0,6,size=(100))})\n",
"\n",
" # here is some stub code to actually run the correlation\n",
" corr, pval=stats.pearsonr(df[\"had_chickenpox_column\"],df[\"num_chickenpox_vaccine_column\"])\n",
" \n",
" # just return the correlation\n",
" #return corr\n",
"\n",
" # YOUR CODE HERE\n",
" raise NotImplementedError()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SEQNUMC</th>\n",
" <th>SEQNUMHH</th>\n",
" <th>PDAT</th>\n",
" <th>PROVWT_D</th>\n",
" <th>RDDWT_D</th>\n",
" <th>STRATUM</th>\n",
" <th>YEAR</th>\n",
" <th>AGECPOXR</th>\n",
" <th>HAD_CPOX</th>\n",
" <th>AGEGRP</th>\n",
" <th>...</th>\n",
" <th>XVRCTY2</th>\n",
" <th>XVRCTY3</th>\n",
" <th>XVRCTY4</th>\n",
" <th>XVRCTY5</th>\n",
" <th>XVRCTY6</th>\n",
" <th>XVRCTY7</th>\n",
" <th>XVRCTY8</th>\n",
" <th>XVRCTY9</th>\n",
" <th>INS_STAT2_I</th>\n",
" <th>INS_BREAK_I</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>128521</td>\n",
" <td>12852</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" <td>235.916956</td>\n",
" <td>1031</td>\n",
" <td>2017</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>10741</td>\n",
" <td>1074</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" <td>957.353840</td>\n",
" <td>1068</td>\n",
" <td>2017</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>220011</td>\n",
" <td>22001</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" <td>189.611299</td>\n",
" <td>1050</td>\n",
" <td>2017</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>86131</td>\n",
" <td>8613</td>\n",
" <td>1</td>\n",
" <td>675.430817</td>\n",
" <td>333.447418</td>\n",
" <td>1040</td>\n",
" <td>2017</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>227141</td>\n",
" <td>22714</td>\n",
" <td>1</td>\n",
" <td>482.617748</td>\n",
" <td>278.768063</td>\n",
" <td>1008</td>\n",
" <td>2017</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 453 columns</p>\n",
"</div>"
],
"text/plain": [
" SEQNUMC SEQNUMHH PDAT PROVWT_D RDDWT_D STRATUM YEAR AGECPOXR \\\n",
"1 128521 12852 2 NaN 235.916956 1031 2017 NaN \n",
"2 10741 1074 2 NaN 957.353840 1068 2017 NaN \n",
"3 220011 22001 2 NaN 189.611299 1050 2017 NaN \n",
"4 86131 8613 1 675.430817 333.447418 1040 2017 NaN \n",
"5 227141 22714 1 482.617748 278.768063 1008 2017 NaN \n",
"\n",
" HAD_CPOX AGEGRP ... XVRCTY2 XVRCTY3 XVRCTY4 XVRCTY5 XVRCTY6 \\\n",
"1 2 1 ... NaN NaN NaN \n",
"2 2 1 ... NaN NaN NaN \n",
"3 2 3 ... NaN NaN NaN \n",
"4 2 1 ... NaN NaN NaN \n",
"5 2 1 ... NaN NaN NaN \n",
"\n",
" XVRCTY7 XVRCTY8 XVRCTY9 INS_STAT2_I INS_BREAK_I \n",
"1 NaN NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN NaN \n",
"4 NaN NaN NaN 1.0 2.0 \n",
"5 NaN NaN NaN 2.0 1.0 \n",
"\n",
"[5 rows x 453 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import scipy.stats as stats\n",
"\n",
"df = pd.read_csv('datasets/NISPUF17.csv', index_col = 0)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>HAD_CPOX</th>\n",
" <th>P_NUMVRC</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28447</th>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28448</th>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28450</th>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28453</th>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28455</th>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>15286 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" HAD_CPOX P_NUMVRC\n",
"4 2 1.0\n",
"5 2 0.0\n",
"7 2 1.0\n",
"8 2 0.0\n",
"9 1 0.0\n",
"... ... ...\n",
"28447 2 1.0\n",
"28448 2 1.0\n",
"28450 2 1.0\n",
"28453 2 1.0\n",
"28455 2 1.0\n",
"\n",
"[15286 rows x 2 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = df[df['HAD_CPOX'].lt(3)].loc[:,['HAD_CPOX','P_NUMVRC']].dropna()\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.07044873460147984"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns = ['had_chicken_col','num_vaccine_col']\n",
"\n",
"corr, pval = stats.pearsonr(df[\"had_chicken_col\"],df[\"num_vaccine_col\"])\n",
"\n",
"corr"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment