{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Avance del proyecto\n",
"\n",
"Para esta semana, deberás tener un cuaderno similar al que te presentamos a continuación. Asegúrate de realizar las operaciones necesarias para que tu conjunto de datos sea lo más preciso posible."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "H9FzFUrajfM7"
},
"source": [
"# Importar datos\n",
"\n",
"Con \"importar datos\" nos referimos a la manera en la que preparamos la fuente de datos para ser leída por nuestro programa.\n",
"\n",
"Existen múltiples maneras de importar la información. Por ejemplo, podemos sencillamente utilizar el mismo método que usamos con nuestro archivo `ejemplo-1.txt`.\n",
"\n",
"Descarga el archivo que quieras utilizar en el directorio de Drive en el que vayas a almacenar tus datos.\n",
"\n",
"Como ejemplo, voy a utilizar los casos nacionales de COVID-19 registrados diariamente durante el primer semestre de 2022: https://datos.cdmx.gob.mx/dataset/casos-asociados-a-covid-19/resource/e5f65f40-5904-492a-ae33-1ea98fb73d78?inner_span=True\n",
"\n",
"Descargo el archivo CSV en un directorio de mi computadora. Posteriormente lo subo a mi directorio de datos de Google Drive:\n",
"\n",
"\n",
"Volvemos a nuestro cuaderno de Google Colab. Me aseguro de haber activado Google Drive en mi Google Colab y busco el directorio en el cual está mi archivo. En mi caso: `'/content/drive/MyDrive/Colab Notebooks/curso_datos/casos_nacionales_covid-19_2022_semestre1.csv'`\n",
"\n",
"Con esos pasos, podemos hacer la importación:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "qnXNK7H2kz3M",
"outputId": "f57572b9-4923-481e-d8c2-0fd5f8b172e4"
},
"outputs": [
{
"data": {
"text/plain": [
"['\"\",\"fecha_actualizacion\",\"id_registro\",\"origen\",\"sector\",\"entidad_um\",\"sexo\",\"entidad_nac\",\"entidad_res\",\"municipio_res\",\"tipo_paciente\",\"fecha_ingreso\",\"fecha_sintomas\",\"fecha_def\",\"intubado\",\"neumonia\",\"edad\",\"nacionalidad\",\"embarazo\",\"habla_lengua_indig\",\"indigena\",\"diabetes\",\"epoc\",\"asma\",\"inmusupr\",\"hipertension\",\"otra_com\",\"cardiovascular\",\"obesidad\",\"renal_cronica\",\"tabaquismo\",\"otro_caso\",\"toma_muestra_lab\",\"resultado_lab\",\"toma_muestra_antigeno\",\"resultado_antigeno\",\"clasificacion_final\",\"migrante\",\"pais_nacionalidad\",\"pais_origen\",\"uci\"\\n']"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"datos = '/content/drive/MyDrive/Colab Notebooks/curso_datos/casos_nacionales_covid-19_2022_semestre1.csv'\n",
"\n",
"with open(datos, 'r') as f:\n",
" data = f.readlines(10) # agrego este argumento porque el archivo es muy extenso.\n",
"\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LukLwOCkpa7t"
},
"source": [
"De esta manera hemos logrado incluir el archivo en nuestro cuaderno, pero será muy complejo manipularlo. Por esta razón, es preferible utilizar una librería que nos ayude a procesar estos datos. En nuestro caso, usaremos 'Pandas'\n",
"\n",
"Para hacer que nuestro programa funcione, solamente debemos importar la librería:\n",
"\n",
"`import pandas as pd`\n",
"\n",
"Y posteriormente podremos abrir nuestro archivo desde Python:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 508
},
"id": "b4yv7auIqCt7",
"outputId": "7492d24f-b249-4f1f-80a3-f15935a62551"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:3326: DtypeWarning: Columns (13) have mixed types.Specify dtype option on import or set low_memory=False.\n",
" exec(code_obj, self.user_global_ns, self.user_ns)\n"
]
},
{
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Unnamed: 0
\n",
"
fecha_actualizacion
\n",
"
id_registro
\n",
"
origen
\n",
"
sector
\n",
"
entidad_um
\n",
"
sexo
\n",
"
entidad_nac
\n",
"
entidad_res
\n",
"
municipio_res
\n",
"
...
\n",
"
otro_caso
\n",
"
toma_muestra_lab
\n",
"
resultado_lab
\n",
"
toma_muestra_antigeno
\n",
"
resultado_antigeno
\n",
"
clasificacion_final
\n",
"
migrante
\n",
"
pais_nacionalidad
\n",
"
pais_origen
\n",
"
uci
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
2022-06-26
\n",
"
0793b8
\n",
"
FUERA DE USMER
\n",
"
SSA
\n",
"
CIUDAD DE MÉXICO
\n",
"
HOMBRE
\n",
"
CIUDAD DE MÉXICO
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NO
\n",
"
NO
\n",
"
NO APLICA (CASO SIN MUESTRA)
\n",
"
SI
\n",
"
NEGATIVO A SARS-COV-2
\n",
"
NEGATIVO A SARS-COV-2
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
NO APLICA
\n",
"
NO APLICA
\n",
"
\n",
"
\n",
"
1
\n",
"
2
\n",
"
2022-06-26
\n",
"
0fef08
\n",
"
USMER
\n",
"
SSA
\n",
"
CIUDAD DE MÉXICO
\n",
"
HOMBRE
\n",
"
CIUDAD DE MÉXICO
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NO
\n",
"
SI
\n",
"
POSITIVO A SARS-COV-2
\n",
"
NO
\n",
"
NO APLICA (CASO SIN MUESTRA)
\n",
"
CASO DE SARS-COV-2 CONFIRMADO
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
NO APLICA
\n",
"
NO APLICA
\n",
"
\n",
"
\n",
"
2
\n",
"
3
\n",
"
2022-06-26
\n",
"
11e31a
\n",
"
FUERA DE USMER
\n",
"
SSA
\n",
"
CIUDAD DE MÉXICO
\n",
"
HOMBRE
\n",
"
CIUDAD DE MÉXICO
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NO
\n",
"
NO
\n",
"
NO APLICA (CASO SIN MUESTRA)
\n",
"
SI
\n",
"
NEGATIVO A SARS-COV-2
\n",
"
NEGATIVO A SARS-COV-2
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
NO APLICA
\n",
"
NO APLICA
\n",
"
\n",
"
\n",
"
3
\n",
"
4
\n",
"
2022-06-26
\n",
"
0741e4
\n",
"
FUERA DE USMER
\n",
"
ISSSTE
\n",
"
CIUDAD DE MÉXICO
\n",
"
HOMBRE
\n",
"
CIUDAD DE MÉXICO
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NO
\n",
"
SI
\n",
"
RESULTADO NO ADECUADO
\n",
"
NO
\n",
"
NO APLICA (CASO SIN MUESTRA)
\n",
"
NO REALIZADO POR LABORATORIO
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
NO APLICA
\n",
"
NO
\n",
"
\n",
"
\n",
"
4
\n",
"
5
\n",
"
2022-06-26
\n",
"
13c92b
\n",
"
FUERA DE USMER
\n",
"
SSA
\n",
"
CIUDAD DE MÉXICO
\n",
"
MUJER
\n",
"
CIUDAD DE MÉXICO
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
SI
\n",
"
NO
\n",
"
NO APLICA (CASO SIN MUESTRA)
\n",
"
SI
\n",
"
NEGATIVO A SARS-COV-2
\n",
"
NEGATIVO A SARS-COV-2
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
NO APLICA
\n",
"
NO APLICA
\n",
"
\n",
" \n",
"
\n",
"
5 rows × 41 columns
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" Unnamed: 0 fecha_actualizacion id_registro origen sector \\\n",
"0 1 2022-06-26 0793b8 FUERA DE USMER SSA \n",
"1 2 2022-06-26 0fef08 USMER SSA \n",
"2 3 2022-06-26 11e31a FUERA DE USMER SSA \n",
"3 4 2022-06-26 0741e4 FUERA DE USMER ISSSTE \n",
"4 5 2022-06-26 13c92b FUERA DE USMER SSA \n",
"\n",
" entidad_um sexo entidad_nac entidad_res municipio_res ... \\\n",
"0 CIUDAD DE MÉXICO HOMBRE CIUDAD DE MÉXICO NaN NaN ... \n",
"1 CIUDAD DE MÉXICO HOMBRE CIUDAD DE MÉXICO NaN NaN ... \n",
"2 CIUDAD DE MÉXICO HOMBRE CIUDAD DE MÉXICO NaN NaN ... \n",
"3 CIUDAD DE MÉXICO HOMBRE CIUDAD DE MÉXICO NaN NaN ... \n",
"4 CIUDAD DE MÉXICO MUJER CIUDAD DE MÉXICO NaN NaN ... \n",
"\n",
" otro_caso toma_muestra_lab resultado_lab \\\n",
"0 NO NO NO APLICA (CASO SIN MUESTRA) \n",
"1 NO SI POSITIVO A SARS-COV-2 \n",
"2 NO NO NO APLICA (CASO SIN MUESTRA) \n",
"3 NO SI RESULTADO NO ADECUADO \n",
"4 SI NO NO APLICA (CASO SIN MUESTRA) \n",
"\n",
" toma_muestra_antigeno resultado_antigeno \\\n",
"0 SI NEGATIVO A SARS-COV-2 \n",
"1 NO NO APLICA (CASO SIN MUESTRA) \n",
"2 SI NEGATIVO A SARS-COV-2 \n",
"3 NO NO APLICA (CASO SIN MUESTRA) \n",
"4 SI NEGATIVO A SARS-COV-2 \n",
"\n",
" clasificacion_final migrante pais_nacionalidad \\\n",
"0 NEGATIVO A SARS-COV-2 NO ESPECIFICADO MÉXICO \n",
"1 CASO DE SARS-COV-2 CONFIRMADO NO ESPECIFICADO MÉXICO \n",
"2 NEGATIVO A SARS-COV-2 NO ESPECIFICADO MÉXICO \n",
"3 NO REALIZADO POR LABORATORIO NO ESPECIFICADO MÉXICO \n",
"4 NEGATIVO A SARS-COV-2 NO ESPECIFICADO MÉXICO \n",
"\n",
" pais_origen uci \n",
"0 NO APLICA NO APLICA \n",
"1 NO APLICA NO APLICA \n",
"2 NO APLICA NO APLICA \n",
"3 NO APLICA NO \n",
"4 NO APLICA NO APLICA \n",
"\n",
"[5 rows x 41 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"covid_nacional = pd.read_csv(datos)\n",
"covid_nacional.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Nu3Ce4XbqZ2J"
},
"source": [
"De esta manera, nuestro archivo estará listo para ser procesado :)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zlne2GAtX-4M"
},
"source": [
"# Análisis de estructuras de datos y preparación\n",
"\n",
"## Describe la fuente de datos\n",
"\n",
"Una descripción simple de la forma de la fuente de datos es la siguiente:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "blLldK-XYOqQ",
"outputId": "5c2f0bf3-43d5-4109-ccf8-22e7cd55332c"
},
"outputs": [
{
"data": {
"text/plain": [
"1323501"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# número de filas\n",
"filas = covid_nacional.shape[0]\n",
"filas"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9tB3RFOpcszB"
},
"source": [
"Esta es una fuente de datos con suficientes campos como para justificar una lectura distante de la información. Difícilmente una persona podría comprender la información que hay en ella solamente \"leyendo\" los datos de esas tablas."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "9X7l8HuLdAvE",
"outputId": "973ca9e2-040a-4475-923d-0945276ae0bf"
},
"outputs": [
{
"data": {
"text/plain": [
"41"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# número de columnas\n",
"columnas = covid_nacional.shape[1]\n",
"columnas"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Jv469AOXdGvl"
},
"source": [
"Además, vemos que es un conjunto de datos con una cantidad significativa de categorías. Esto permite que con una sola fuente de información se puedan realizar operaciones de comparación entre columnas para analizar la información."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "4PUmMBdvda_-",
"outputId": "9d941421-83a6-4631-a735-a2fd7d9bb1ef"
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Unnamed: 0', 'fecha_actualizacion', 'id_registro', 'origen', 'sector',\n",
" 'entidad_um', 'sexo', 'entidad_nac', 'entidad_res', 'municipio_res',\n",
" 'tipo_paciente', 'fecha_ingreso', 'fecha_sintomas', 'fecha_def',\n",
" 'intubado', 'neumonia', 'edad', 'nacionalidad', 'embarazo',\n",
" 'habla_lengua_indig', 'indigena', 'diabetes', 'epoc', 'asma',\n",
" 'inmusupr', 'hipertension', 'otra_com', 'cardiovascular', 'obesidad',\n",
" 'renal_cronica', 'tabaquismo', 'otro_caso', 'toma_muestra_lab',\n",
" 'resultado_lab', 'toma_muestra_antigeno', 'resultado_antigeno',\n",
" 'clasificacion_final', 'migrante', 'pais_nacionalidad', 'pais_origen',\n",
" 'uci'],\n",
" dtype='object')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# nombre de las columnas\n",
"covid_nacional.columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Mbs9hsyUdjlA"
},
"source": [
"El nombre de las columnas nos ayuda a identificar las categorías y posibles datos que contienen nuestra fuente de datos.\n",
"\n",
"No todas las fuentes de datos nombran sus columnas de manera significativa. En el caso de nuestro ejemplo, es bastante sencillo identificar qué tipo de información contiene cada categoría o columna, incluso qué tipo de dato sería deseable que tuviese cada una."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rzqdf8zfd-qH"
},
"source": [
"## Tipos de datos con `dtypes()`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "OrisVD9BeE4q",
"outputId": "0c0203fb-cc96-47a2-a57f-cc6423ae668f"
},
"outputs": [
{
"data": {
"text/plain": [
"Unnamed: 0 int64\n",
"fecha_actualizacion object\n",
"id_registro object\n",
"origen object\n",
"sector object\n",
"entidad_um object\n",
"sexo object\n",
"entidad_nac object\n",
"entidad_res object\n",
"municipio_res object\n",
"tipo_paciente object\n",
"fecha_ingreso object\n",
"fecha_sintomas object\n",
"fecha_def object\n",
"intubado object\n",
"neumonia object\n",
"edad int64\n",
"nacionalidad object\n",
"embarazo object\n",
"habla_lengua_indig object\n",
"indigena object\n",
"diabetes object\n",
"epoc object\n",
"asma object\n",
"inmusupr object\n",
"hipertension object\n",
"otra_com object\n",
"cardiovascular object\n",
"obesidad object\n",
"renal_cronica object\n",
"tabaquismo object\n",
"otro_caso object\n",
"toma_muestra_lab object\n",
"resultado_lab object\n",
"toma_muestra_antigeno object\n",
"resultado_antigeno object\n",
"clasificacion_final object\n",
"migrante object\n",
"pais_nacionalidad object\n",
"pais_origen object\n",
"uci object\n",
"dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"covid_nacional.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mYTidlIaeJeG"
},
"source": [
"La mayoría de los datos se encuentran representados como tipo `object`, es decir, que son de tipo texto, numérico-textual o mixto.\n",
"\n",
"Aunque hay columnas que podrían tener un tipo de dato `datetime`, están representadas en tipo `object`. Esas columnas deberán ser transformadas para poder hacer operaciones y visualizaciones.\n",
"\n",
"## Descripción de los datos con `describe()`"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 300
},
"id": "Xgau6A4jfIWY",
"outputId": "59956d08-9c55-4dc1-d4e1-9c648302dc4c"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" "
],
"text/plain": [
" Unnamed: 0 fecha_actualizacion id_registro origen sector \\\n",
"count 1.323501e+06 1323501 1323501 1323501 1323501 \n",
"unique NaN 1 1323501 2 12 \n",
"top NaN 2022-06-26 0793b8 FUERA DE USMER SSA \n",
"freq NaN 1323501 1 1170267 793606 \n",
"mean 6.617510e+05 NaN NaN NaN NaN \n",
"std 3.820620e+05 NaN NaN NaN NaN \n",
"min 1.000000e+00 NaN NaN NaN NaN \n",
"25% 3.308760e+05 NaN NaN NaN NaN \n",
"50% 6.617510e+05 NaN NaN NaN NaN \n",
"75% 9.926260e+05 NaN NaN NaN NaN \n",
"max 1.323501e+06 NaN NaN NaN NaN \n",
"\n",
" entidad_um sexo entidad_nac entidad_res \\\n",
"count 1323501 1323501 1323501 149707 \n",
"unique 32 2 33 23 \n",
"top CIUDAD DE MÉXICO MUJER CIUDAD DE MÉXICO MÉXICO \n",
"freq 1314661 733991 1052272 133374 \n",
"mean NaN NaN NaN NaN \n",
"std NaN NaN NaN NaN \n",
"min NaN NaN NaN NaN \n",
"25% NaN NaN NaN NaN \n",
"50% NaN NaN NaN NaN \n",
"75% NaN NaN NaN NaN \n",
"max NaN NaN NaN NaN \n",
"\n",
" municipio_res ... otro_caso toma_muestra_lab \\\n",
"count 149707 ... 1323501 1323501 \n",
"unique 1190 ... 3 2 \n",
"top NEZAHUALCÓYOTL ... NO NO \n",
"freq 26282 ... 848434 1152385 \n",
"mean NaN ... NaN NaN \n",
"std NaN ... NaN NaN \n",
"min NaN ... NaN NaN \n",
"25% NaN ... NaN NaN \n",
"50% NaN ... NaN NaN \n",
"75% NaN ... NaN NaN \n",
"max NaN ... NaN NaN \n",
"\n",
" resultado_lab toma_muestra_antigeno \\\n",
"count 1323501 1323501 \n",
"unique 5 2 \n",
"top NO APLICA (CASO SIN MUESTRA) SI \n",
"freq 1152385 1204565 \n",
"mean NaN NaN \n",
"std NaN NaN \n",
"min NaN NaN \n",
"25% NaN NaN \n",
"50% NaN NaN \n",
"75% NaN NaN \n",
"max NaN NaN \n",
"\n",
" resultado_antigeno clasificacion_final migrante \\\n",
"count 1323501 1323501 1323501 \n",
"unique 3 7 3 \n",
"top NEGATIVO A SARS-COV-2 NEGATIVO A SARS-COV-2 NO ESPECIFICADO \n",
"freq 771647 792364 1305180 \n",
"mean NaN NaN NaN \n",
"std NaN NaN NaN \n",
"min NaN NaN NaN \n",
"25% NaN NaN NaN \n",
"50% NaN NaN NaN \n",
"75% NaN NaN NaN \n",
"max NaN NaN NaN \n",
"\n",
" pais_nacionalidad pais_origen uci \n",
"count 1323501 1320040 1323501 \n",
"unique 122 1 4 \n",
"top MÉXICO NO APLICA NO APLICA \n",
"freq 1304673 1320040 1297093 \n",
"mean NaN NaN NaN \n",
"std NaN NaN NaN \n",
"min NaN NaN NaN \n",
"25% NaN NaN NaN \n",
"50% NaN NaN NaN \n",
"75% NaN NaN NaN \n",
"max NaN NaN NaN \n",
"\n",
"[11 rows x 41 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"covid_nacional.describe(include='all')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NdoEF5mofwNe"
},
"source": [
"El parámetro `include='all'` obliga a realizar la operación en todas las columnas. \n",
"\n",
"Esto permite identificar algunas columnas con ciertas frecuencias que podrían ser sujeto de análisis. Por ejemplo, correlaciones entre enfermedades crónicas y resultados (positivos o negativos), o frecuencias de casos de migrantes, mujeres o indígenas relacionadas con un área geográfica.\n",
"\n",
"Debido a que esta fuente de datos no cuenta con información georeferenciada (contamos con los nombres de los municipios, pero no la información de latitud y longitud) será necesario utilizar una fuente de datos que permita agregar esa información."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MtLRzOfYWhTk"
},
"source": [
"# Procesamiento de datos\n",
"\n",
"## Manipulación de datos\n",
"\n",
"Aplicación del método `.iloc` para localizar filas y columnas por índice:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "FQDBACFAW3-7",
"outputId": "ab3fcd26-cbaf-4146-ff81-07c7fb89bcae"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" "
],
"text/plain": [
" Unnamed: 0 fecha_actualizacion id_registro origen sector \\\n",
"252 253 2022-06-26 b94888 FUERA DE USMER SSA \n",
"971 972 2022-06-26 d22ed2 USMER SSA \n",
"979 980 2022-06-26 6a5061 USMER SSA \n",
"5877 5878 2022-06-26 ac1990 FUERA DE USMER PRIVADA \n",
"6666 6667 2022-06-26 8d5273 FUERA DE USMER SSA \n",
"... ... ... ... ... ... \n",
"1298223 1298224 2022-06-26 g16c3a9 FUERA DE USMER PRIVADA \n",
"1305240 1305241 2022-06-26 g154063 FUERA DE USMER SSA \n",
"1305279 1305280 2022-06-26 g1683fe FUERA DE USMER SSA \n",
"1316685 1316686 2022-06-26 g0ebf9f FUERA DE USMER SSA \n",
"1319864 1319865 2022-06-26 g093480 FUERA DE USMER SSA \n",
"\n",
" entidad_um sexo entidad_nac entidad_res municipio_res \\\n",
"252 CIUDAD DE MÉXICO MUJER NO ESPECIFICADO NaN NaN \n",
"971 CIUDAD DE MÉXICO MUJER NO ESPECIFICADO NaN NaN \n",
"979 CIUDAD DE MÉXICO MUJER NO ESPECIFICADO NaN NaN \n",
"5877 CIUDAD DE MÉXICO MUJER NO ESPECIFICADO NaN NaN \n",
"6666 CIUDAD DE MÉXICO MUJER NO ESPECIFICADO NaN NaN \n",
"... ... ... ... ... ... \n",
"1298223 CIUDAD DE MÉXICO MUJER NO ESPECIFICADO NaN NaN \n",
"1305240 CIUDAD DE MÉXICO MUJER NO ESPECIFICADO NaN NaN \n",
"1305279 CIUDAD DE MÉXICO MUJER NO ESPECIFICADO NaN NaN \n",
"1316685 CIUDAD DE MÉXICO MUJER NO ESPECIFICADO NaN NaN \n",
"1319864 CIUDAD DE MÉXICO MUJER NO ESPECIFICADO NaN NaN \n",
"\n",
" ... otro_caso toma_muestra_lab resultado_lab \\\n",
"252 ... SI NO NO APLICA (CASO SIN MUESTRA) \n",
"971 ... SI NO NO APLICA (CASO SIN MUESTRA) \n",
"979 ... SI NO NO APLICA (CASO SIN MUESTRA) \n",
"5877 ... NO NO NO APLICA (CASO SIN MUESTRA) \n",
"6666 ... NO NO NO APLICA (CASO SIN MUESTRA) \n",
"... ... ... ... ... \n",
"1298223 ... NO NO NO APLICA (CASO SIN MUESTRA) \n",
"1305240 ... NO NO NO APLICA (CASO SIN MUESTRA) \n",
"1305279 ... NO NO NO APLICA (CASO SIN MUESTRA) \n",
"1316685 ... NO NO NO APLICA (CASO SIN MUESTRA) \n",
"1319864 ... NO NO NO APLICA (CASO SIN MUESTRA) \n",
"\n",
" toma_muestra_antigeno resultado_antigeno clasificacion_final \\\n",
"252 SI NEGATIVO A SARS-COV-2 NEGATIVO A SARS-COV-2 \n",
"971 SI NEGATIVO A SARS-COV-2 NEGATIVO A SARS-COV-2 \n",
"979 SI NEGATIVO A SARS-COV-2 NEGATIVO A SARS-COV-2 \n",
"5877 SI NEGATIVO A SARS-COV-2 NEGATIVO A SARS-COV-2 \n",
"6666 SI NEGATIVO A SARS-COV-2 NEGATIVO A SARS-COV-2 \n",
"... ... ... ... \n",
"1298223 SI NEGATIVO A SARS-COV-2 NEGATIVO A SARS-COV-2 \n",
"1305240 SI NEGATIVO A SARS-COV-2 NEGATIVO A SARS-COV-2 \n",
"1305279 SI NEGATIVO A SARS-COV-2 NEGATIVO A SARS-COV-2 \n",
"1316685 SI NEGATIVO A SARS-COV-2 NEGATIVO A SARS-COV-2 \n",
"1319864 SI NEGATIVO A SARS-COV-2 NEGATIVO A SARS-COV-2 \n",
"\n",
" migrante pais_nacionalidad pais_origen uci \n",
"252 SI VENEZUELA NaN NO APLICA \n",
"971 SI ESTADOS UNIDOS DE AMÉRICA NaN NO APLICA \n",
"979 SI ESTADOS UNIDOS DE AMÉRICA NaN NO APLICA \n",
"5877 SI ESTADOS UNIDOS DE AMÉRICA NaN NO APLICA \n",
"6666 SI CUBA NaN NO APLICA \n",
"... ... ... ... ... \n",
"1298223 SI ITALIA NaN NO APLICA \n",
"1305240 SI EL SALVADOR NaN NO APLICA \n",
"1305279 SI GUATEMALA NaN NO APLICA \n",
"1316685 SI REPÚBLICA DE HONDURAS NaN NO APLICA \n",
"1319864 SI CHILE NaN NO APLICA \n",
"\n",
"[1611 rows x 41 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"seleccion = covid_nacional.loc[(covid_nacional['sexo'] == 'MUJER') & (covid_nacional['migrante'] == 'SI')]\n",
"seleccion"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wqwpZgPDYwqG"
},
"source": [
"Renombramos las columnas para poder realizar correctamente la unión entre dos dataframes:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 358
},
"id": "mebho60TY5-r",
"outputId": "1dd52513-4deb-4082-83ab-5bc06975bafb"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Unnamed: 0
\n",
"
fecha_actualizacion
\n",
"
id_registro
\n",
"
origen
\n",
"
sector
\n",
"
entidad_um
\n",
"
sexo
\n",
"
entidad_nacimiento
\n",
"
entidad_residencia
\n",
"
municipio_residencia
\n",
"
...
\n",
"
otro_caso
\n",
"
toma_muestra_lab
\n",
"
resultado_lab
\n",
"
toma_muestra_antigeno
\n",
"
resultado_antigeno
\n",
"
clasificacion_final
\n",
"
migrante
\n",
"
pais_nacionalidad
\n",
"
pais_origen
\n",
"
uci
\n",
"
\n",
" \n",
" \n",
"
\n",
"
7
\n",
"
8
\n",
"
2022-06-26
\n",
"
0ba73d
\n",
"
FUERA DE USMER
\n",
"
ISSSTE
\n",
"
CIUDAD DE MÉXICO
\n",
"
MUJER
\n",
"
QUERÉTARO
\n",
"
MÉXICO
\n",
"
NAUCALPAN DE JUÁREZ
\n",
"
...
\n",
"
NO
\n",
"
NO
\n",
"
NO APLICA (CASO SIN MUESTRA)
\n",
"
NO
\n",
"
NO APLICA (CASO SIN MUESTRA)
\n",
"
CASO SOSPECHOSO
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
NO APLICA
\n",
"
NO APLICA
\n",
"
\n",
"
\n",
"
8
\n",
"
9
\n",
"
2022-06-26
\n",
"
0681f2
\n",
"
FUERA DE USMER
\n",
"
SSA
\n",
"
CIUDAD DE MÉXICO
\n",
"
HOMBRE
\n",
"
CIUDAD DE MÉXICO
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NO
\n",
"
NO
\n",
"
NO APLICA (CASO SIN MUESTRA)
\n",
"
SI
\n",
"
NEGATIVO A SARS-COV-2
\n",
"
NEGATIVO A SARS-COV-2
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
NO APLICA
\n",
"
NO APLICA
\n",
"
\n",
"
\n",
"
9
\n",
"
10
\n",
"
2022-06-26
\n",
"
0a98b4
\n",
"
FUERA DE USMER
\n",
"
SSA
\n",
"
CIUDAD DE MÉXICO
\n",
"
MUJER
\n",
"
MICHOACÁN DE OCAMPO
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NO
\n",
"
NO
\n",
"
NO APLICA (CASO SIN MUESTRA)
\n",
"
SI
\n",
"
POSITIVO A SARS-COV-2
\n",
"
CASO DE SARS-COV-2 CONFIRMADO
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
NO APLICA
\n",
"
NO APLICA
\n",
"
\n",
" \n",
"
\n",
"
3 rows × 41 columns
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" Unnamed: 0 fecha_actualizacion id_registro origen sector \\\n",
"7 8 2022-06-26 0ba73d FUERA DE USMER ISSSTE \n",
"8 9 2022-06-26 0681f2 FUERA DE USMER SSA \n",
"9 10 2022-06-26 0a98b4 FUERA DE USMER SSA \n",
"\n",
" entidad_um sexo entidad_nacimiento entidad_residencia \\\n",
"7 CIUDAD DE MÉXICO MUJER QUERÉTARO MÉXICO \n",
"8 CIUDAD DE MÉXICO HOMBRE CIUDAD DE MÉXICO NaN \n",
"9 CIUDAD DE MÉXICO MUJER MICHOACÁN DE OCAMPO NaN \n",
"\n",
" municipio_residencia ... otro_caso toma_muestra_lab \\\n",
"7 NAUCALPAN DE JUÁREZ ... NO NO \n",
"8 NaN ... NO NO \n",
"9 NaN ... NO NO \n",
"\n",
" resultado_lab toma_muestra_antigeno \\\n",
"7 NO APLICA (CASO SIN MUESTRA) NO \n",
"8 NO APLICA (CASO SIN MUESTRA) SI \n",
"9 NO APLICA (CASO SIN MUESTRA) SI \n",
"\n",
" resultado_antigeno clasificacion_final \\\n",
"7 NO APLICA (CASO SIN MUESTRA) CASO SOSPECHOSO \n",
"8 NEGATIVO A SARS-COV-2 NEGATIVO A SARS-COV-2 \n",
"9 POSITIVO A SARS-COV-2 CASO DE SARS-COV-2 CONFIRMADO \n",
"\n",
" migrante pais_nacionalidad pais_origen uci \n",
"7 NO ESPECIFICADO MÉXICO NO APLICA NO APLICA \n",
"8 NO ESPECIFICADO MÉXICO NO APLICA NO APLICA \n",
"9 NO ESPECIFICADO MÉXICO NO APLICA NO APLICA \n",
"\n",
"[3 rows x 41 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"covid_nacional.rename(columns={\n",
" \"entidad_nac\": \"entidad_nacimiento\",\n",
" \"entidad_res\": \"entidad_residencia\",\n",
" \"municipio_res\": \"municipio_residencia\"\n",
"}, inplace=True)\n",
"covid_nacional[7:10]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "epndNHsqZAw5"
},
"source": [
"## Merge\n",
"\n",
"Nuevo conjunto de datos para realizar la combinación:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 392
},
"id": "H8vomfbAZIDU",
"outputId": "9f7cbe5d-3976-4368-a4c5-57282e86bc14"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" "
],
"text/plain": [
" Unnamed: 0 fecha_actualizacion id_registro origen sector \\\n",
"0 8 2022-06-26 0ba73d FUERA DE USMER ISSSTE \n",
"1 143 2022-06-26 588e9b FUERA DE USMER SSA \n",
"2 154 2022-06-26 51860a USMER SSA \n",
"3 912 2022-06-26 de16a0 USMER SSA \n",
"4 1032 2022-06-26 5f39e3 USMER SSA \n",
"\n",
" entidad_um sexo entidad_nacimiento entidad_residencia \\\n",
"0 CIUDAD DE MÉXICO MUJER QUERÉTARO MÉXICO \n",
"1 CIUDAD DE MÉXICO MUJER CIUDAD DE MÉXICO MÉXICO \n",
"2 CIUDAD DE MÉXICO HOMBRE CIUDAD DE MÉXICO MÉXICO \n",
"3 CIUDAD DE MÉXICO MUJER CIUDAD DE MÉXICO MÉXICO \n",
"4 CIUDAD DE MÉXICO HOMBRE GUANAJUATO MÉXICO \n",
"\n",
" municipio_residencia ... Latitud Longitud Lat_Decimal \\\n",
"0 naucalpan de juárez ... 19°28´43.690N\" 099°13´59.585W\" 19.478803 \n",
"1 naucalpan de juárez ... 19°28´43.690N\" 099°13´59.585W\" 19.478803 \n",
"2 naucalpan de juárez ... 19°28´43.690N\" 099°13´59.585W\" 19.478803 \n",
"3 naucalpan de juárez ... 19°28´43.690N\" 099°13´59.585W\" 19.478803 \n",
"4 naucalpan de juárez ... 19°28´43.690N\" 099°13´59.585W\" 19.478803 \n",
"\n",
" Lon_Decimal Altitud Cve_Carta Pob_Total Pob_Masculina Pob_Femenina \\\n",
"0 -99.233218 2280 E14A39 776220 373698 402522 \n",
"1 -99.233218 2280 E14A39 776220 373698 402522 \n",
"2 -99.233218 2280 E14A39 776220 373698 402522 \n",
"3 -99.233218 2280 E14A39 776220 373698 402522 \n",
"4 -99.233218 2280 E14A39 776220 373698 402522 \n",
"\n",
" Total De Viviendas Habitadas \n",
"0 225509 \n",
"1 225509 \n",
"2 225509 \n",
"3 225509 \n",
"4 225509 \n",
"\n",
"[5 rows x 59 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conjunto_datos = pd.merge(covid_nacional, areas_inegi_tm, how='inner', on='municipio_residencia')\n",
"print(conjunto_datos.shape)\n",
"conjunto_datos.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "I6v6L4v8bgT5"
},
"source": [
"## Limpieza de datos\n",
"\n",
"### Segmentación por columnas útiles"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 357
},
"id": "2eP6bVDfblg0",
"outputId": "890e15ab-aa5d-4d0b-caaf-fb0d7f3b40ae"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
sexo
\n",
"
edad
\n",
"
entidad_nacimiento
\n",
"
municipio_residencia
\n",
"
indigena
\n",
"
nacionalidad
\n",
"
migrante
\n",
"
pais_nacionalidad
\n",
"
fecha_ingreso
\n",
"
fecha_sintomas
\n",
"
fecha_def
\n",
"
municipio_residencia
\n",
"
Lat_Decimal
\n",
"
Lon_Decimal
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
MUJER
\n",
"
75
\n",
"
QUERÉTARO
\n",
"
naucalpan de juárez
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-02-21
\n",
"
2022-02-16
\n",
"
NaN
\n",
"
naucalpan de juárez
\n",
"
19.478803
\n",
"
-99.233218
\n",
"
\n",
"
\n",
"
1
\n",
"
MUJER
\n",
"
32
\n",
"
CIUDAD DE MÉXICO
\n",
"
naucalpan de juárez
\n",
"
NO ESPECIFICADO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-01-07
\n",
"
2022-01-02
\n",
"
NaN
\n",
"
naucalpan de juárez
\n",
"
19.478803
\n",
"
-99.233218
\n",
"
\n",
"
\n",
"
2
\n",
"
HOMBRE
\n",
"
30
\n",
"
CIUDAD DE MÉXICO
\n",
"
naucalpan de juárez
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-02-04
\n",
"
2022-02-03
\n",
"
NaN
\n",
"
naucalpan de juárez
\n",
"
19.478803
\n",
"
-99.233218
\n",
"
\n",
"
\n",
"
3
\n",
"
MUJER
\n",
"
51
\n",
"
CIUDAD DE MÉXICO
\n",
"
naucalpan de juárez
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-01-01
\n",
"
2021-12-28
\n",
"
NaN
\n",
"
naucalpan de juárez
\n",
"
19.478803
\n",
"
-99.233218
\n",
"
\n",
"
\n",
"
4
\n",
"
HOMBRE
\n",
"
83
\n",
"
GUANAJUATO
\n",
"
naucalpan de juárez
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-01-01
\n",
"
2021-12-30
\n",
"
NaN
\n",
"
naucalpan de juárez
\n",
"
19.478803
\n",
"
-99.233218
\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" sexo edad entidad_nacimiento municipio_residencia indigena \\\n",
"0 MUJER 75 QUERÉTARO naucalpan de juárez NO \n",
"1 MUJER 32 CIUDAD DE MÉXICO naucalpan de juárez NO ESPECIFICADO \n",
"2 HOMBRE 30 CIUDAD DE MÉXICO naucalpan de juárez NO \n",
"3 MUJER 51 CIUDAD DE MÉXICO naucalpan de juárez NO \n",
"4 HOMBRE 83 GUANAJUATO naucalpan de juárez NO \n",
"\n",
" nacionalidad migrante pais_nacionalidad fecha_ingreso \\\n",
"0 MEXICANA NO ESPECIFICADO MÉXICO 2022-02-21 \n",
"1 MEXICANA NO ESPECIFICADO MÉXICO 2022-01-07 \n",
"2 MEXICANA NO ESPECIFICADO MÉXICO 2022-02-04 \n",
"3 MEXICANA NO ESPECIFICADO MÉXICO 2022-01-01 \n",
"4 MEXICANA NO ESPECIFICADO MÉXICO 2022-01-01 \n",
"\n",
" fecha_sintomas fecha_def municipio_residencia Lat_Decimal Lon_Decimal \n",
"0 2022-02-16 NaN naucalpan de juárez 19.478803 -99.233218 \n",
"1 2022-01-02 NaN naucalpan de juárez 19.478803 -99.233218 \n",
"2 2022-02-03 NaN naucalpan de juárez 19.478803 -99.233218 \n",
"3 2021-12-28 NaN naucalpan de juárez 19.478803 -99.233218 \n",
"4 2021-12-30 NaN naucalpan de juárez 19.478803 -99.233218 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"muestra_covid = conjunto_datos[['sexo', 'edad', 'entidad_nacimiento', 'municipio_residencia', 'indigena', 'nacionalidad', 'migrante', 'pais_nacionalidad', 'fecha_ingreso', 'fecha_sintomas', 'fecha_def', 'municipio_residencia', 'Lat_Decimal', 'Lon_Decimal']]\n",
"muestra_covid.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "y7SLyxm3cJT6"
},
"source": [
"### Lidiar con datos nulos"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 835
},
"id": "jyX1zIkocLwU",
"outputId": "e9f70bed-dda9-47b6-b3a3-bf9fc5863fe4"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:5182: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" downcast=downcast,\n",
"/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py:6392: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" return self._update_inplace(result)\n"
]
},
{
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
sexo
\n",
"
edad
\n",
"
entidad_nacimiento
\n",
"
municipio_residencia
\n",
"
indigena
\n",
"
nacionalidad
\n",
"
migrante
\n",
"
pais_nacionalidad
\n",
"
fecha_ingreso
\n",
"
fecha_sintomas
\n",
"
fecha_def
\n",
"
municipio_residencia
\n",
"
Lat_Decimal
\n",
"
Lon_Decimal
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
MUJER
\n",
"
75
\n",
"
QUERÉTARO
\n",
"
naucalpan de juárez
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-02-21
\n",
"
2022-02-16
\n",
"
NaN
\n",
"
naucalpan de juárez
\n",
"
19.478803
\n",
"
-99.233218
\n",
"
\n",
"
\n",
"
1
\n",
"
MUJER
\n",
"
32
\n",
"
CIUDAD DE MÉXICO
\n",
"
naucalpan de juárez
\n",
"
NO ESPECIFICADO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-01-07
\n",
"
2022-01-02
\n",
"
NaN
\n",
"
naucalpan de juárez
\n",
"
19.478803
\n",
"
-99.233218
\n",
"
\n",
"
\n",
"
2
\n",
"
HOMBRE
\n",
"
30
\n",
"
CIUDAD DE MÉXICO
\n",
"
naucalpan de juárez
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-02-04
\n",
"
2022-02-03
\n",
"
NaN
\n",
"
naucalpan de juárez
\n",
"
19.478803
\n",
"
-99.233218
\n",
"
\n",
"
\n",
"
3
\n",
"
MUJER
\n",
"
51
\n",
"
CIUDAD DE MÉXICO
\n",
"
naucalpan de juárez
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-01-01
\n",
"
2021-12-28
\n",
"
NaN
\n",
"
naucalpan de juárez
\n",
"
19.478803
\n",
"
-99.233218
\n",
"
\n",
"
\n",
"
4
\n",
"
HOMBRE
\n",
"
83
\n",
"
GUANAJUATO
\n",
"
naucalpan de juárez
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-01-01
\n",
"
2021-12-30
\n",
"
NaN
\n",
"
naucalpan de juárez
\n",
"
19.478803
\n",
"
-99.233218
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
158080
\n",
"
HOMBRE
\n",
"
12
\n",
"
VERACRUZ DE IGNACIO DE LA LLAVE
\n",
"
amatlán de los reyes
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-06-23
\n",
"
2022-06-23
\n",
"
NaN
\n",
"
amatlán de los reyes
\n",
"
18.847578
\n",
"
-96.915484
\n",
"
\n",
"
\n",
"
158081
\n",
"
MUJER
\n",
"
46
\n",
"
CIUDAD DE MÉXICO
\n",
"
amatlán de los reyes
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-06-22
\n",
"
2022-06-19
\n",
"
NaN
\n",
"
amatlán de los reyes
\n",
"
18.847578
\n",
"
-96.915484
\n",
"
\n",
"
\n",
"
158082
\n",
"
MUJER
\n",
"
59
\n",
"
CIUDAD DE MÉXICO
\n",
"
general simón bolívar
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-06-23
\n",
"
2022-06-22
\n",
"
NaN
\n",
"
general simón bolívar
\n",
"
24.689074
\n",
"
-103.225975
\n",
"
\n",
"
\n",
"
158083
\n",
"
MUJER
\n",
"
27
\n",
"
MÉXICO
\n",
"
temozón
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-06-24
\n",
"
2022-06-22
\n",
"
NaN
\n",
"
temozón
\n",
"
20.803680
\n",
"
-88.201158
\n",
"
\n",
"
\n",
"
158084
\n",
"
MUJER
\n",
"
32
\n",
"
MÉXICO
\n",
"
izamal
\n",
"
NO
\n",
"
MEXICANA
\n",
"
NO ESPECIFICADO
\n",
"
MÉXICO
\n",
"
2022-06-24
\n",
"
2022-06-20
\n",
"
NaN
\n",
"
izamal
\n",
"
20.932998
\n",
"
-89.019715
\n",
"
\n",
" \n",
"
\n",
"
158085 rows × 14 columns
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" sexo edad entidad_nacimiento municipio_residencia \\\n",
"0 MUJER 75 QUERÉTARO naucalpan de juárez \n",
"1 MUJER 32 CIUDAD DE MÉXICO naucalpan de juárez \n",
"2 HOMBRE 30 CIUDAD DE MÉXICO naucalpan de juárez \n",
"3 MUJER 51 CIUDAD DE MÉXICO naucalpan de juárez \n",
"4 HOMBRE 83 GUANAJUATO naucalpan de juárez \n",
"... ... ... ... ... \n",
"158080 HOMBRE 12 VERACRUZ DE IGNACIO DE LA LLAVE amatlán de los reyes \n",
"158081 MUJER 46 CIUDAD DE MÉXICO amatlán de los reyes \n",
"158082 MUJER 59 CIUDAD DE MÉXICO general simón bolívar \n",
"158083 MUJER 27 MÉXICO temozón \n",
"158084 MUJER 32 MÉXICO izamal \n",
"\n",
" indigena nacionalidad migrante pais_nacionalidad \\\n",
"0 NO MEXICANA NO ESPECIFICADO MÉXICO \n",
"1 NO ESPECIFICADO MEXICANA NO ESPECIFICADO MÉXICO \n",
"2 NO MEXICANA NO ESPECIFICADO MÉXICO \n",
"3 NO MEXICANA NO ESPECIFICADO MÉXICO \n",
"4 NO MEXICANA NO ESPECIFICADO MÉXICO \n",
"... ... ... ... ... \n",
"158080 NO MEXICANA NO ESPECIFICADO MÉXICO \n",
"158081 NO MEXICANA NO ESPECIFICADO MÉXICO \n",
"158082 NO MEXICANA NO ESPECIFICADO MÉXICO \n",
"158083 NO MEXICANA NO ESPECIFICADO MÉXICO \n",
"158084 NO MEXICANA NO ESPECIFICADO MÉXICO \n",
"\n",
" fecha_ingreso fecha_sintomas fecha_def municipio_residencia \\\n",
"0 2022-02-21 2022-02-16 NaN naucalpan de juárez \n",
"1 2022-01-07 2022-01-02 NaN naucalpan de juárez \n",
"2 2022-02-04 2022-02-03 NaN naucalpan de juárez \n",
"3 2022-01-01 2021-12-28 NaN naucalpan de juárez \n",
"4 2022-01-01 2021-12-30 NaN naucalpan de juárez \n",
"... ... ... ... ... \n",
"158080 2022-06-23 2022-06-23 NaN amatlán de los reyes \n",
"158081 2022-06-22 2022-06-19 NaN amatlán de los reyes \n",
"158082 2022-06-23 2022-06-22 NaN general simón bolívar \n",
"158083 2022-06-24 2022-06-22 NaN temozón \n",
"158084 2022-06-24 2022-06-20 NaN izamal \n",
"\n",
" Lat_Decimal Lon_Decimal \n",
"0 19.478803 -99.233218 \n",
"1 19.478803 -99.233218 \n",
"2 19.478803 -99.233218 \n",
"3 19.478803 -99.233218 \n",
"4 19.478803 -99.233218 \n",
"... ... ... \n",
"158080 18.847578 -96.915484 \n",
"158081 18.847578 -96.915484 \n",
"158082 24.689074 -103.225975 \n",
"158083 20.803680 -88.201158 \n",
"158084 20.932998 -89.019715 \n",
"\n",
"[158085 rows x 14 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"muestra_covid.fillna({'municipio_residencia': 'NO APLICA', 'pais_nacionalidad': 'NO APLICA'}, inplace=True)\n",
"muestra_covid"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xhoZAZ12ca4D"
},
"source": [
"### Transformar datos"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "kevd6_hXcdwH",
"outputId": "eceaa809-2185-4adf-a301-b46e870b0fbf"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" This is separate from the ipykernel package so we can avoid doing imports until\n"
]
},
{
"data": {
"text/plain": [
"sexo object\n",
"edad int64\n",
"entidad_nacimiento object\n",
"municipio_residencia object\n",
"indigena object\n",
"nacionalidad object\n",
"migrante object\n",
"pais_nacionalidad object\n",
"fecha_ingreso datetime64[ns]\n",
"fecha_sintomas datetime64[ns]\n",
"fecha_def datetime64[ns]\n",
"municipio_residencia object\n",
"Lat_Decimal float64\n",
"Lon_Decimal float64\n",
"dtype: object"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"columnas = ['fecha_ingreso', 'fecha_sintomas', 'fecha_def']\n",
"for columna in columnas:\n",
" muestra_covid[columna] = pd.to_datetime(muestra_covid.loc[:, columna])\n",
"\n",
"muestra_covid.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6YSKtk_p18es"
},
"source": [
"# Georeferenciar los datos de `pais_nacionalidad`"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "RuPkl6XG3Xpe",
"outputId": "eabebed2-03a1-4744-dfc2-b4a24f9f8f50"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
"Collecting pycountry\n",
" Downloading pycountry-22.3.5.tar.gz (10.1 MB)\n",
"\u001b[K |████████████████████████████████| 10.1 MB 25.6 MB/s \n",
"\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
" Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
" Preparing wheel metadata ... \u001b[?25l\u001b[?25hdone\n",
"Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from pycountry) (57.4.0)\n",
"Building wheels for collected packages: pycountry\n",
" Building wheel for pycountry (PEP 517) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for pycountry: filename=pycountry-22.3.5-py2.py3-none-any.whl size=10681845 sha256=b33752378c0cefcf260b7f88dcdb7027d6691af7862ba5b8fcbb59b9832d4c66\n",
" Stored in directory: /root/.cache/pip/wheels/0e/06/e8/7ee176e95ea9a8a8c3b3afcb1869f20adbd42413d4611c6eb4\n",
"Successfully built pycountry\n",
"Installing collected packages: pycountry\n",
"Successfully installed pycountry-22.3.5\n"
]
}
],
"source": [
"# obtener librería pycountry\n",
"!pip install pycountry\n",
"import pycountry"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"id": "-5KKVYb92Ac9"
},
"outputs": [],
"source": [
"# Función para convertir los datos a iso en español\n",
"\n",
"import gettext\n",
"\n",
"def map_country_code(country_name, language, iso):\n",
" '''\n",
" country_name: str. El nombre del país en español.\n",
" language: str. El idioma en el que se desea obtener el código (p. ej: 'es').\n",
" iso: str. Opciones posibles: 'alpha_2' o 'alpha_3'.\n",
" '''\n",
" try:\n",
" if country_name is None:\n",
" return None\n",
" elif country_name == 'MÉXICO': # esta condición sintetiza el caso de México (reduce de 5 minutos a 6 segundos el tiempo de ejecución)\n",
" if iso == 'alpha_2':\n",
" return 'MX'\n",
" elif iso == 'alpha_3':\n",
" return 'MEX'\n",
" spanish = gettext.translation('iso3166', pycountry.LOCALES_DIR, languages=[language])\n",
" spanish.install()\n",
" _ = spanish.gettext\n",
" for english_country in pycountry.countries:\n",
" country_name = country_name.lower()\n",
" spanish_country = _(english_country.name).lower()\n",
" if spanish_country == country_name:\n",
" if iso == 'alpha_3':\n",
" return english_country.alpha_3\n",
" elif iso == 'alpha_2':\n",
" return english_country.alpha_2\n",
" except Exception as e:\n",
" raise"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "P8kDOx2_2KBW"
},
"source": [
"Conversión de los nombres a códigos alpha"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 565
},
"id": "VmlTv8n32OYj",
"outputId": "199f7879-e7d3-4ddc-eafe-edc884769cfb"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" \"\"\"Entry point for launching an IPython kernel.\n",
"/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" \n"
]
},
{
"data": {
"text/html": [
"\n",
"