{
"cells": [
{
"cell_type": "markdown",
"id": "b69db3b2",
"metadata": {},
"source": [
"# Combinar fuentes de datos\n",
"\n",
"En las bases de datos relacionales, combinar tablas es una de las tareas más comunes y necesarias. Por lo general, la información se divide para hacerla más fácil de gestionar y por ello al recuperarla necesitamos combinarlas para obtener un resultado coherente en una sola tabla.\n",
"\n",
"Para este ejercicio necesitaremos dos fuentes de datos, así que aprovecharé el \"Catálogo Único de Claves de Áreas Geoestadísticas Estatales, Municipales y Localidades\" que proveé el [INEGI](https://www.inegi.org.mx/app/ageeml/) para obtener las coordenadas geográficas de los municipios que se anotan en el conjunto de datos por Covid."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "21666a7e",
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/jairoantonio/opt/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3444: DtypeWarning: Columns (13) have mixed types.Specify dtype option on import or set low_memory=False.\n",
" exec(code_obj, self.user_global_ns, self.user_ns)\n"
]
}
],
"source": [
"import pandas as pd\n",
"try:\n",
" covid_nacional = pd.read_csv('../data/casos_nacionales_covid-19_2022_semestre1.csv')\n",
"except FileNotFoundError:\n",
" covid_nacional = pd.read_csv('../datos/casos_nacionales_covid-19_2022_semestre1.csv')\n",
"covid_nacional.rename(columns={\n",
" \"entidad_nac\": \"entidad_nacimiento\",\n",
" \"entidad_res\": \"entidad_residencia\",\n",
" \"municipio_res\": \"municipio_residencia\"\n",
"}, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1ffb465c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(300083, 19)\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Mapa | \n",
" Cve_Ent | \n",
" Nom_Ent | \n",
" Nom_Abr | \n",
" Cve_Mun | \n",
" Nom_Mun | \n",
" Cve_Loc | \n",
" Nom_Loc | \n",
" Ámbito | \n",
" Latitud | \n",
" Longitud | \n",
" Lat_Decimal | \n",
" Lon_Decimal | \n",
" Altitud | \n",
" Cve_Carta | \n",
" Pob_Total | \n",
" Pob_Masculina | \n",
" Pob_Femenina | \n",
" Total De Viviendas Habitadas | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10010001 | \n",
" 1 | \n",
" Aguascalientes | \n",
" Ags. | \n",
" 1 | \n",
" Aguascalientes | \n",
" 1 | \n",
" Aguascalientes | \n",
" U | \n",
" 21°52´47.362N\" | \n",
" 102°17´45.768W\" | \n",
" 21.879823 | \n",
" -102.296047 | \n",
" 1878 | \n",
" F13D19 | \n",
" 863893 | \n",
" 419168 | \n",
" 444725 | \n",
" 246259 | \n",
"
\n",
" \n",
" 1 | \n",
" 10010094 | \n",
" 1 | \n",
" Aguascalientes | \n",
" Ags. | \n",
" 1 | \n",
" Aguascalientes | \n",
" 94 | \n",
" Granja Adelita | \n",
" R | \n",
" 21°52´18.749N\" | \n",
" 102°22´24.710W\" | \n",
" 21.871875 | \n",
" -102.373531 | \n",
" 1901 | \n",
" F13D18 | \n",
" 5 | \n",
" * | \n",
" * | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 10010096 | \n",
" 1 | \n",
" Aguascalientes | \n",
" Ags. | \n",
" 1 | \n",
" Aguascalientes | \n",
" 96 | \n",
" Agua Azul | \n",
" R | \n",
" 21°53´01.522N\" | \n",
" 102°21´25.639W\" | \n",
" 21.883756 | \n",
" -102.357122 | \n",
" 1861 | \n",
" F13D18 | \n",
" 41 | \n",
" 24 | \n",
" 17 | \n",
" 12 | \n",
"
\n",
" \n",
" 3 | \n",
" 10010100 | \n",
" 1 | \n",
" Aguascalientes | \n",
" Ags. | \n",
" 1 | \n",
" Aguascalientes | \n",
" 100 | \n",
" Rancho Alegre | \n",
" R | \n",
" 21°51´16.556N\" | \n",
" 102°22´21.884W\" | \n",
" 21.854599 | \n",
" -102.372746 | \n",
" 1879 | \n",
" F13D18 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 10010102 | \n",
" 1 | \n",
" Aguascalientes | \n",
" Ags. | \n",
" 1 | \n",
" Aguascalientes | \n",
" 102 | \n",
" Los Arbolitos [Rancho] | \n",
" R | \n",
" 21°46´48.650N\" | \n",
" 102°21´26.261W\" | \n",
" 21.780181 | \n",
" -102.357295 | \n",
" 1861 | \n",
" F13D18 | \n",
" 8 | \n",
" * | \n",
" * | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Mapa Cve_Ent Nom_Ent Nom_Abr Cve_Mun Nom_Mun \\\n",
"0 10010001 1 Aguascalientes Ags. 1 Aguascalientes \n",
"1 10010094 1 Aguascalientes Ags. 1 Aguascalientes \n",
"2 10010096 1 Aguascalientes Ags. 1 Aguascalientes \n",
"3 10010100 1 Aguascalientes Ags. 1 Aguascalientes \n",
"4 10010102 1 Aguascalientes Ags. 1 Aguascalientes \n",
"\n",
" Cve_Loc Nom_Loc Ámbito Latitud Longitud \\\n",
"0 1 Aguascalientes U 21°52´47.362N\" 102°17´45.768W\" \n",
"1 94 Granja Adelita R 21°52´18.749N\" 102°22´24.710W\" \n",
"2 96 Agua Azul R 21°53´01.522N\" 102°21´25.639W\" \n",
"3 100 Rancho Alegre R 21°51´16.556N\" 102°22´21.884W\" \n",
"4 102 Los Arbolitos [Rancho] R 21°46´48.650N\" 102°21´26.261W\" \n",
"\n",
" Lat_Decimal Lon_Decimal Altitud Cve_Carta Pob_Total Pob_Masculina \\\n",
"0 21.879823 -102.296047 1878 F13D19 863893 419168 \n",
"1 21.871875 -102.373531 1901 F13D18 5 * \n",
"2 21.883756 -102.357122 1861 F13D18 41 24 \n",
"3 21.854599 -102.372746 1879 F13D18 0 0 \n",
"4 21.780181 -102.357295 1861 F13D18 8 * \n",
"\n",
" Pob_Femenina Total De Viviendas Habitadas \n",
"0 444725 246259 \n",
"1 * 2 \n",
"2 17 12 \n",
"3 0 0 \n",
"4 * 2 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"ruta_areas_inegi = '../data/AGEEML_2022842026272.csv'\n",
"areas_inegi = pd.read_csv(ruta_areas_inegi)\n",
"print(areas_inegi.shape)\n",
"areas_inegi.head()"
]
},
{
"cell_type": "markdown",
"id": "c0e05629",
"metadata": {},
"source": [
"Verás que nuestro nuevo conjunto de datos contiene 300,083 filas y 19 columnas. Además de los datos correspondientes a las coordenadas geográficas tenemos otros datos como la altitud, el total de población (total y por sexos), el total de viviendas habitadas, entre otros datos relacionados con claves de entidades.\n",
"\n",
"Tenemos como dato común los municipios de residencia de los casos por Covid y el nombre del municipio (`Nom_Mun`) de los datos del INEGI. Ya que aprendimos a renombrar las columnas, aprovechemos para renombrar las columnas de las coordenadas de tal manera que podamos crear una columna común sobre la cual hacer la unión."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "b214b160",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Mapa | \n",
" Cve_Ent | \n",
" Nom_Ent | \n",
" Nom_Abr | \n",
" Cve_Mun | \n",
" municipio_residencia | \n",
" Cve_Loc | \n",
" Nom_Loc | \n",
" Ámbito | \n",
" Latitud | \n",
" Longitud | \n",
" Lat_Decimal | \n",
" Lon_Decimal | \n",
" Altitud | \n",
" Cve_Carta | \n",
" Pob_Total | \n",
" Pob_Masculina | \n",
" Pob_Femenina | \n",
" Total De Viviendas Habitadas | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10010001 | \n",
" 1 | \n",
" Aguascalientes | \n",
" Ags. | \n",
" 1 | \n",
" Aguascalientes | \n",
" 1 | \n",
" Aguascalientes | \n",
" U | \n",
" 21°52´47.362N\" | \n",
" 102°17´45.768W\" | \n",
" 21.879823 | \n",
" -102.296047 | \n",
" 1878 | \n",
" F13D19 | \n",
" 863893 | \n",
" 419168 | \n",
" 444725 | \n",
" 246259 | \n",
"
\n",
" \n",
" 1 | \n",
" 10010094 | \n",
" 1 | \n",
" Aguascalientes | \n",
" Ags. | \n",
" 1 | \n",
" Aguascalientes | \n",
" 94 | \n",
" Granja Adelita | \n",
" R | \n",
" 21°52´18.749N\" | \n",
" 102°22´24.710W\" | \n",
" 21.871875 | \n",
" -102.373531 | \n",
" 1901 | \n",
" F13D18 | \n",
" 5 | \n",
" * | \n",
" * | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 10010096 | \n",
" 1 | \n",
" Aguascalientes | \n",
" Ags. | \n",
" 1 | \n",
" Aguascalientes | \n",
" 96 | \n",
" Agua Azul | \n",
" R | \n",
" 21°53´01.522N\" | \n",
" 102°21´25.639W\" | \n",
" 21.883756 | \n",
" -102.357122 | \n",
" 1861 | \n",
" F13D18 | \n",
" 41 | \n",
" 24 | \n",
" 17 | \n",
" 12 | \n",
"
\n",
" \n",
" 3 | \n",
" 10010100 | \n",
" 1 | \n",
" Aguascalientes | \n",
" Ags. | \n",
" 1 | \n",
" Aguascalientes | \n",
" 100 | \n",
" Rancho Alegre | \n",
" R | \n",
" 21°51´16.556N\" | \n",
" 102°22´21.884W\" | \n",
" 21.854599 | \n",
" -102.372746 | \n",
" 1879 | \n",
" F13D18 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 10010102 | \n",
" 1 | \n",
" Aguascalientes | \n",
" Ags. | \n",
" 1 | \n",
" Aguascalientes | \n",
" 102 | \n",
" Los Arbolitos [Rancho] | \n",
" R | \n",
" 21°46´48.650N\" | \n",
" 102°21´26.261W\" | \n",
" 21.780181 | \n",
" -102.357295 | \n",
" 1861 | \n",
" F13D18 | \n",
" 8 | \n",
" * | \n",
" * | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Mapa Cve_Ent Nom_Ent Nom_Abr Cve_Mun municipio_residencia \\\n",
"0 10010001 1 Aguascalientes Ags. 1 Aguascalientes \n",
"1 10010094 1 Aguascalientes Ags. 1 Aguascalientes \n",
"2 10010096 1 Aguascalientes Ags. 1 Aguascalientes \n",
"3 10010100 1 Aguascalientes Ags. 1 Aguascalientes \n",
"4 10010102 1 Aguascalientes Ags. 1 Aguascalientes \n",
"\n",
" Cve_Loc Nom_Loc Ámbito Latitud Longitud \\\n",
"0 1 Aguascalientes U 21°52´47.362N\" 102°17´45.768W\" \n",
"1 94 Granja Adelita R 21°52´18.749N\" 102°22´24.710W\" \n",
"2 96 Agua Azul R 21°53´01.522N\" 102°21´25.639W\" \n",
"3 100 Rancho Alegre R 21°51´16.556N\" 102°22´21.884W\" \n",
"4 102 Los Arbolitos [Rancho] R 21°46´48.650N\" 102°21´26.261W\" \n",
"\n",
" Lat_Decimal Lon_Decimal Altitud Cve_Carta Pob_Total Pob_Masculina \\\n",
"0 21.879823 -102.296047 1878 F13D19 863893 419168 \n",
"1 21.871875 -102.373531 1901 F13D18 5 * \n",
"2 21.883756 -102.357122 1861 F13D18 41 24 \n",
"3 21.854599 -102.372746 1879 F13D18 0 0 \n",
"4 21.780181 -102.357295 1861 F13D18 8 * \n",
"\n",
" Pob_Femenina Total De Viviendas Habitadas \n",
"0 444725 246259 \n",
"1 * 2 \n",
"2 17 12 \n",
"3 0 0 \n",
"4 * 2 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"areas_inegi.rename(\n",
" columns={'Nom_Mun':'municipio_residencia'}, # recordemos que cambiamos el nombre de la columna en el ejercicio anterior\n",
" inplace=True)\n",
"areas_inegi.head()"
]
},
{
"cell_type": "markdown",
"id": "3c08766a",
"metadata": {},
"source": [
"Ahora sí podremos empezar a \"jugar\" con nuestros datos para lograr un conjunto de datos con la información de geolocalización.\n",
"\n",
"## \"Inner\" merge\n",
"\n",
"De manera predeterminada, `pandas` combina nuestros datos con una unión \"inner\", es decir, que encuentra la información común de ambos conjuntos de datos y excluye la que no está en ambos. Podemos representar lo anterior a través de este diagrama de Venn:\n",
"\n",
"```{image} ../_static/imgs/merges/inner.jpeg\n",
":align: center\n",
"```\n",
"\n",
"Para hacer la unión de nuestros datos usaremos la siguiente sintaxis:\n",
"\n",
"```python\n",
"pd.merge(left, right, how='inner')\n",
"```\n",
"\n",
"Lo cual se traduce en:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c9b8efa5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" fecha_actualizacion | \n",
" id_registro | \n",
" origen | \n",
" sector | \n",
" entidad_um | \n",
" sexo | \n",
" entidad_nacimiento | \n",
" entidad_residencia | \n",
" municipio_residencia | \n",
" ... | \n",
" Latitud | \n",
" Longitud | \n",
" Lat_Decimal | \n",
" Lon_Decimal | \n",
" Altitud | \n",
" Cve_Carta | \n",
" Pob_Total | \n",
" Pob_Masculina | \n",
" Pob_Femenina | \n",
" Total De Viviendas Habitadas | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
0 rows × 59 columns
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [Unnamed: 0, fecha_actualizacion, id_registro, origen, sector, entidad_um, sexo, entidad_nacimiento, entidad_residencia, municipio_residencia, tipo_paciente, fecha_ingreso, fecha_sintomas, fecha_def, intubado, neumonia, edad, nacionalidad, embarazo, habla_lengua_indig, indigena, diabetes, epoc, asma, inmusupr, hipertension, otra_com, cardiovascular, obesidad, renal_cronica, tabaquismo, otro_caso, toma_muestra_lab, resultado_lab, toma_muestra_antigeno, resultado_antigeno, clasificacion_final, migrante, pais_nacionalidad, pais_origen, uci, Mapa, Cve_Ent, Nom_Ent, Nom_Abr, Cve_Mun, Cve_Loc, Nom_Loc, Ámbito, Latitud, Longitud, Lat_Decimal, Lon_Decimal, Altitud, Cve_Carta, Pob_Total, Pob_Masculina, Pob_Femenina, Total De Viviendas Habitadas]\n",
"Index: []\n",
"\n",
"[0 rows x 59 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inner_merge = pd.merge(covid_nacional, areas_inegi, how='inner', on='municipio_residencia')\n",
"inner_merge.head()"
]
},
{
"cell_type": "markdown",
"id": "a65b480f",
"metadata": {},
"source": [
"Pero ¡este no es el resultado que esperábamos! ¿Por qué nos regresa una tabla vacía pero con 59 columnas?\n",
"\n",
"Esto es fácil de identificar: `pandas` no encuentra datos comunes porque la información del conjunto de datos de `covid_nacional` está escrito en mayúsculas sostenidas y la información del conjunto de datos de `areas_inegi` está escrito en mayúsculas y minúsculas. Más adelante veremos con mayor detalle como \"normalizar\" nuestros datos para evitar estas inconsistencias.\n",
"\n",
"Por lo pronto, podemos transformar nuestros datos a minúsculas en la columna `municipio_residencia` tanto en `covid_nacional` como en `areas_inegi` para que podamos hacer la unión. Afortunadamente, ambos conjuntos de datos conservan la acentuación gráfica de los datos, lo que nos facilita su unificación."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "76e4c85d",
"metadata": {},
"outputs": [],
"source": [
"covid_nacional['municipio_residencia'] = covid_nacional['municipio_residencia'].str.lower()\n",
"areas_inegi['municipio_residencia'] = areas_inegi['municipio_residencia'].str.lower()"
]
},
{
"cell_type": "markdown",
"id": "b39be36c",
"metadata": {},
"source": [
"Ahora, apliquemos nuevamente la unión a nuestro conjunto de datos. Pero primero, vamos a reducir nuestro conjunto de datos para hacerlo más manejable.\n",
"\n",
"```{admonition} ¿Por qué reducimos los datos?\n",
":class: tip\n",
"Al experimentar con grandes conjuntos de datos es altamente recomendable ser amable con los recursos computacionales. Si hacemos estos ejercicios directamente con todo el conjunto de datos nos tomará más tiempo y estaremos expuestos a errores de memoria.\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f5afb9cb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(624, 59)\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" fecha_actualizacion | \n",
" id_registro | \n",
" origen | \n",
" sector | \n",
" entidad_um | \n",
" sexo | \n",
" entidad_nacimiento | \n",
" entidad_residencia | \n",
" municipio_residencia | \n",
" ... | \n",
" Latitud | \n",
" Longitud | \n",
" Lat_Decimal | \n",
" Lon_Decimal | \n",
" Altitud | \n",
" Cve_Carta | \n",
" Pob_Total | \n",
" Pob_Masculina | \n",
" Pob_Femenina | \n",
" Total De Viviendas Habitadas | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" 2022-06-26 | \n",
" 0ba73d | \n",
" FUERA DE USMER | \n",
" ISSSTE | \n",
" CIUDAD DE MÉXICO | \n",
" MUJER | \n",
" QUERÉTARO | \n",
" MÉXICO | \n",
" naucalpan de juárez | \n",
" ... | \n",
" 19°28´43.690N\" | \n",
" 099°13´59.585W\" | \n",
" 19.478803 | \n",
" -99.233218 | \n",
" 2280 | \n",
" E14A39 | \n",
" 776220 | \n",
" 373698 | \n",
" 402522 | \n",
" 225509 | \n",
"
\n",
" \n",
" 1 | \n",
" 8 | \n",
" 2022-06-26 | \n",
" 0ba73d | \n",
" FUERA DE USMER | \n",
" ISSSTE | \n",
" CIUDAD DE MÉXICO | \n",
" MUJER | \n",
" QUERÉTARO | \n",
" MÉXICO | \n",
" naucalpan de juárez | \n",
" ... | \n",
" 19°26´28.806N\" | \n",
" 099°20´21.556W\" | \n",
" 19.441335 | \n",
" -99.339321 | \n",
" 2846 | \n",
" E14A38 | \n",
" 9920 | \n",
" 4802 | \n",
" 5118 | \n",
" 2404 | \n",
"
\n",
" \n",
" 2 | \n",
" 8 | \n",
" 2022-06-26 | \n",
" 0ba73d | \n",
" FUERA DE USMER | \n",
" ISSSTE | \n",
" CIUDAD DE MÉXICO | \n",
" MUJER | \n",
" QUERÉTARO | \n",
" MÉXICO | \n",
" naucalpan de juárez | \n",
" ... | \n",
" 19°28´32.024N\" | \n",
" 099°20´39.833W\" | \n",
" 19.475562 | \n",
" -99.344398 | \n",
" 2738 | \n",
" E14A38 | \n",
" 3595 | \n",
" 1773 | \n",
" 1822 | \n",
" 942 | \n",
"
\n",
" \n",
" 3 | \n",
" 8 | \n",
" 2022-06-26 | \n",
" 0ba73d | \n",
" FUERA DE USMER | \n",
" ISSSTE | \n",
" CIUDAD DE MÉXICO | \n",
" MUJER | \n",
" QUERÉTARO | \n",
" MÉXICO | \n",
" naucalpan de juárez | \n",
" ... | \n",
" 19°26´21.138N\" | \n",
" 099°22´38.200W\" | \n",
" 19.439205 | \n",
" -99.377278 | \n",
" 3298 | \n",
" E14A38 | \n",
" 50 | \n",
" 28 | \n",
" 22 | \n",
" 21 | \n",
"
\n",
" \n",
" 4 | \n",
" 8 | \n",
" 2022-06-26 | \n",
" 0ba73d | \n",
" FUERA DE USMER | \n",
" ISSSTE | \n",
" CIUDAD DE MÉXICO | \n",
" MUJER | \n",
" QUERÉTARO | \n",
" MÉXICO | \n",
" naucalpan de juárez | \n",
" ... | \n",
" 19°29´44.683N\" | \n",
" 099°18´19.417W\" | \n",
" 19.495745 | \n",
" -99.305394 | \n",
" 2501 | \n",
" E14A39 | \n",
" 1013 | \n",
" 490 | \n",
" 523 | \n",
" 255 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 59 columns
\n",
"
"
],
"text/plain": [
" Unnamed: 0 fecha_actualizacion id_registro origen sector \\\n",
"0 8 2022-06-26 0ba73d FUERA DE USMER ISSSTE \n",
"1 8 2022-06-26 0ba73d FUERA DE USMER ISSSTE \n",
"2 8 2022-06-26 0ba73d FUERA DE USMER ISSSTE \n",
"3 8 2022-06-26 0ba73d FUERA DE USMER ISSSTE \n",
"4 8 2022-06-26 0ba73d FUERA DE USMER ISSSTE \n",
"\n",
" entidad_um sexo entidad_nacimiento entidad_residencia \\\n",
"0 CIUDAD DE MÉXICO MUJER QUERÉTARO MÉXICO \n",
"1 CIUDAD DE MÉXICO MUJER QUERÉTARO MÉXICO \n",
"2 CIUDAD DE MÉXICO MUJER QUERÉTARO MÉXICO \n",
"3 CIUDAD DE MÉXICO MUJER QUERÉTARO MÉXICO \n",
"4 CIUDAD DE MÉXICO MUJER QUERÉTARO MÉXICO \n",
"\n",
" municipio_residencia ... Latitud Longitud Lat_Decimal \\\n",
"0 naucalpan de juárez ... 19°28´43.690N\" 099°13´59.585W\" 19.478803 \n",
"1 naucalpan de juárez ... 19°26´28.806N\" 099°20´21.556W\" 19.441335 \n",
"2 naucalpan de juárez ... 19°28´32.024N\" 099°20´39.833W\" 19.475562 \n",
"3 naucalpan de juárez ... 19°26´21.138N\" 099°22´38.200W\" 19.439205 \n",
"4 naucalpan de juárez ... 19°29´44.683N\" 099°18´19.417W\" 19.495745 \n",
"\n",
" Lon_Decimal Altitud Cve_Carta Pob_Total Pob_Masculina Pob_Femenina \\\n",
"0 -99.233218 2280 E14A39 776220 373698 402522 \n",
"1 -99.339321 2846 E14A38 9920 4802 5118 \n",
"2 -99.344398 2738 E14A38 3595 1773 1822 \n",
"3 -99.377278 3298 E14A38 50 28 22 \n",
"4 -99.305394 2501 E14A39 1013 490 523 \n",
"\n",
" Total De Viviendas Habitadas \n",
"0 225509 \n",
"1 2404 \n",
"2 942 \n",
"3 21 \n",
"4 255 \n",
"\n",
"[5 rows x 59 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"covid_nacional_tm = covid_nacional[:100] # reducimos el conjunto de datos a 100 filas\n",
"\n",
"inner_merge = pd.merge(covid_nacional_tm, areas_inegi, how='inner', on='municipio_residencia')\n",
"print(inner_merge.shape)\n",
"inner_merge.head()"
]
},
{
"cell_type": "markdown",
"id": "c926bc17",
"metadata": {},
"source": [
"Ahora estamos enfrentándonos a otro problema: ¡el resultado de la unión son 624 filas! ¿Por qué? Esto se debe a que los datos de `municipio_residencia` no están asociados con un campo único. Esto lo podemos ver aquí:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "a70931b8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Mapa | \n",
" Cve_Ent | \n",
" Nom_Ent | \n",
" Nom_Abr | \n",
" Cve_Mun | \n",
" municipio_residencia | \n",
" Cve_Loc | \n",
" Nom_Loc | \n",
" Ámbito | \n",
" Latitud | \n",
" Longitud | \n",
" Lat_Decimal | \n",
" Lon_Decimal | \n",
" Altitud | \n",
" Cve_Carta | \n",
" Pob_Total | \n",
" Pob_Masculina | \n",
" Pob_Femenina | \n",
" Total De Viviendas Habitadas | \n",
"
\n",
" \n",
" \n",
" \n",
" 140805 | \n",
" 150250001 | \n",
" 15 | \n",
" México | \n",
" Mex. | \n",
" 25 | \n",
" chalco | \n",
" 1 | \n",
" Chalco de Díaz Covarrubias | \n",
" U | \n",
" 19°15´40.074N\" | \n",
" 098°53´44.357W\" | \n",
" 19.261132 | \n",
" -98.895655 | \n",
" 2238 | \n",
" E14B31 | \n",
" 174704 | \n",
" 83207 | \n",
" 91497 | \n",
" 45578 | \n",
"
\n",
" \n",
" 140806 | \n",
" 150250002 | \n",
" 15 | \n",
" México | \n",
" Mex. | \n",
" 25 | \n",
" chalco | \n",
" 2 | \n",
" La Candelaria Tlapala | \n",
" U | \n",
" 19°14´26.710N\" | \n",
" 098°50´49.712W\" | \n",
" 19.240753 | \n",
" -98.847142 | \n",
" 2254 | \n",
" E14B41 | \n",
" 9446 | \n",
" 4620 | \n",
" 4826 | \n",
" 2428 | \n",
"
\n",
" \n",
" 140807 | \n",
" 150250005 | \n",
" 15 | \n",
" México | \n",
" Mex. | \n",
" 25 | \n",
" chalco | \n",
" 5 | \n",
" San Gregorio Cuautzingo | \n",
" U | \n",
" 19°15´25.528N\" | \n",
" 098°51´29.893W\" | \n",
" 19.257091 | \n",
" -98.858304 | \n",
" 2244 | \n",
" E14B31 | \n",
" 8485 | \n",
" 4109 | \n",
" 4376 | \n",
" 2134 | \n",
"
\n",
" \n",
" 140808 | \n",
" 150250009 | \n",
" 15 | \n",
" México | \n",
" Mex. | \n",
" 25 | \n",
" chalco | \n",
" 9 | \n",
" Instituto Damián (Ex-Hacienda San Juan de Dios) | \n",
" R | \n",
" 19°14´16.312N\" | \n",
" 098°54´32.585W\" | \n",
" 19.237864 | \n",
" -98.909051 | \n",
" 2241 | \n",
" E14B41 | \n",
" 1 | \n",
" * | \n",
" * | \n",
" 1 | \n",
"
\n",
" \n",
" 140809 | \n",
" 150250010 | \n",
" 15 | \n",
" México | \n",
" Mex. | \n",
" 25 | \n",
" chalco | \n",
" 10 | \n",
" San Juan Tezompa | \n",
" U | \n",
" 19°12´25.717N\" | \n",
" 098°57´37.832W\" | \n",
" 19.207144 | \n",
" -98.960509 | \n",
" 2239 | \n",
" E14B41 | \n",
" 13127 | \n",
" 6422 | \n",
" 6705 | \n",
" 3453 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 140872 | \n",
" 150250147 | \n",
" 15 | \n",
" México | \n",
" Mex. | \n",
" 25 | \n",
" chalco | \n",
" 147 | \n",
" Hacienda San Juan [Conjunto Urbano] | \n",
" U | \n",
" 19°16´08.352N\" | \n",
" 098°50´25.142W\" | \n",
" 19.268987 | \n",
" -98.840317 | \n",
" 2257 | \n",
" E14B31 | \n",
" 1623 | \n",
" 774 | \n",
" 849 | \n",
" 458 | \n",
"
\n",
" \n",
" 140873 | \n",
" 150250148 | \n",
" 15 | \n",
" México | \n",
" Mex. | \n",
" 25 | \n",
" chalco | \n",
" 148 | \n",
" Villas de San Martín [Conjunto Urbano] | \n",
" U | \n",
" 19°16´03.967N\" | \n",
" 098°50´09.801W\" | \n",
" 19.267769 | \n",
" -98.836056 | \n",
" 2258 | \n",
" E14B31 | \n",
" 5714 | \n",
" 2786 | \n",
" 2928 | \n",
" 1698 | \n",
"
\n",
" \n",
" 140874 | \n",
" 150250149 | \n",
" 15 | \n",
" México | \n",
" Mex. | \n",
" 25 | \n",
" chalco | \n",
" 149 | \n",
" Los Héroes Chalco [Conjunto Urbano] | \n",
" U | \n",
" 19°15´37.533N\" | \n",
" 098°50´29.037W\" | \n",
" 19.260426 | \n",
" -98.841399 | \n",
" 2253 | \n",
" E14B31 | \n",
" 34277 | \n",
" 16552 | \n",
" 17725 | \n",
" 9965 | \n",
"
\n",
" \n",
" 140875 | \n",
" 150250151 | \n",
" 15 | \n",
" México | \n",
" Mex. | \n",
" 25 | \n",
" chalco | \n",
" 151 | \n",
" Valle Plateado [Fraccionamiento] | \n",
" R | \n",
" 19°13´06.312N\" | \n",
" 098°47´15.789W\" | \n",
" 19.218420 | \n",
" -98.787719 | \n",
" 2449 | \n",
" E14B41 | \n",
" 67 | \n",
" 36 | \n",
" 31 | \n",
" 20 | \n",
"
\n",
" \n",
" 140876 | \n",
" 150250152 | \n",
" 15 | \n",
" México | \n",
" Mex. | \n",
" 25 | \n",
" chalco | \n",
" 152 | \n",
" La Pulida | \n",
" R | \n",
" 19°13´30.678N\" | \n",
" 098°49´25.945W\" | \n",
" 19.225188 | \n",
" -98.823874 | \n",
" 2292 | \n",
" E14B41 | \n",
" 154 | \n",
" 68 | \n",
" 86 | \n",
" 39 | \n",
"
\n",
" \n",
"
\n",
"
72 rows × 19 columns
\n",
"
"
],
"text/plain": [
" Mapa Cve_Ent Nom_Ent Nom_Abr Cve_Mun municipio_residencia \\\n",
"140805 150250001 15 México Mex. 25 chalco \n",
"140806 150250002 15 México Mex. 25 chalco \n",
"140807 150250005 15 México Mex. 25 chalco \n",
"140808 150250009 15 México Mex. 25 chalco \n",
"140809 150250010 15 México Mex. 25 chalco \n",
"... ... ... ... ... ... ... \n",
"140872 150250147 15 México Mex. 25 chalco \n",
"140873 150250148 15 México Mex. 25 chalco \n",
"140874 150250149 15 México Mex. 25 chalco \n",
"140875 150250151 15 México Mex. 25 chalco \n",
"140876 150250152 15 México Mex. 25 chalco \n",
"\n",
" Cve_Loc Nom_Loc Ámbito \\\n",
"140805 1 Chalco de Díaz Covarrubias U \n",
"140806 2 La Candelaria Tlapala U \n",
"140807 5 San Gregorio Cuautzingo U \n",
"140808 9 Instituto Damián (Ex-Hacienda San Juan de Dios) R \n",
"140809 10 San Juan Tezompa U \n",
"... ... ... ... \n",
"140872 147 Hacienda San Juan [Conjunto Urbano] U \n",
"140873 148 Villas de San Martín [Conjunto Urbano] U \n",
"140874 149 Los Héroes Chalco [Conjunto Urbano] U \n",
"140875 151 Valle Plateado [Fraccionamiento] R \n",
"140876 152 La Pulida R \n",
"\n",
" Latitud Longitud Lat_Decimal Lon_Decimal Altitud \\\n",
"140805 19°15´40.074N\" 098°53´44.357W\" 19.261132 -98.895655 2238 \n",
"140806 19°14´26.710N\" 098°50´49.712W\" 19.240753 -98.847142 2254 \n",
"140807 19°15´25.528N\" 098°51´29.893W\" 19.257091 -98.858304 2244 \n",
"140808 19°14´16.312N\" 098°54´32.585W\" 19.237864 -98.909051 2241 \n",
"140809 19°12´25.717N\" 098°57´37.832W\" 19.207144 -98.960509 2239 \n",
"... ... ... ... ... ... \n",
"140872 19°16´08.352N\" 098°50´25.142W\" 19.268987 -98.840317 2257 \n",
"140873 19°16´03.967N\" 098°50´09.801W\" 19.267769 -98.836056 2258 \n",
"140874 19°15´37.533N\" 098°50´29.037W\" 19.260426 -98.841399 2253 \n",
"140875 19°13´06.312N\" 098°47´15.789W\" 19.218420 -98.787719 2449 \n",
"140876 19°13´30.678N\" 098°49´25.945W\" 19.225188 -98.823874 2292 \n",
"\n",
" Cve_Carta Pob_Total Pob_Masculina Pob_Femenina \\\n",
"140805 E14B31 174704 83207 91497 \n",
"140806 E14B41 9446 4620 4826 \n",
"140807 E14B31 8485 4109 4376 \n",
"140808 E14B41 1 * * \n",
"140809 E14B41 13127 6422 6705 \n",
"... ... ... ... ... \n",
"140872 E14B31 1623 774 849 \n",
"140873 E14B31 5714 2786 2928 \n",
"140874 E14B31 34277 16552 17725 \n",
"140875 E14B41 67 36 31 \n",
"140876 E14B41 154 68 86 \n",
"\n",
" Total De Viviendas Habitadas \n",
"140805 45578 \n",
"140806 2428 \n",
"140807 2134 \n",
"140808 1 \n",
"140809 3453 \n",
"... ... \n",
"140872 458 \n",
"140873 1698 \n",
"140874 9965 \n",
"140875 20 \n",
"140876 39 \n",
"\n",
"[72 rows x 19 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"areas_inegi.loc[areas_inegi['municipio_residencia'] == 'chalco']"
]
},
{
"cell_type": "markdown",
"id": "45f7db7d",
"metadata": {},
"source": [
"Al hacer el merge, el conjunto de datos se multiplica por las coincidencias, encontrándonos así con una tabla que nos da información inconsistente.\n",
"\n",
"```{admonition} ¡importante!\n",
":class: tip\n",
"Un error común al hacer un merge es terminar con un conjunto de datos mucho mayor del esperado. Esto se debe a que provienen de fuentes diferentes y no se puede predecir con exactitud cuántos datos se obtendrán.\n",
"Por esta razón es **fundamental** asegurarse de que el conjunto de datos resultante sea el esperado. Una manera práctica de hacerlo es valiéndose del módulo shape. Si el resultado es un número inconsistente (por ejemplo, que 100 casos de Covid se convirtieran en 624) debemos revisar dónde se encuentra el error. \n",
"\n",
"Lastimosamente no existe una fórmula que abarque todos los casos, así que este es un paso que depende en buena medida de nuestra capacidad para detectar errores.\n",
"```\n",
"\n",
"Tendremos finalmente que hacer una depuración de los datos de `areas_inegi_tm` para que coincida con nuestros datos. Encontramos que el registro de datos de `covid_nacional` usa como `municipio_residencia` el nombre principal del municipio, en este caso, coincide con la clave `1` del campo `Cve_Loc` en `areas_inegi`. Segmentaremos nuestro conjunto de datos valiéndonos del método `.loc` y podremos hacer nuevamente nuestro 'inner merge'"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "57368622",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(15, 59)\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" fecha_actualizacion | \n",
" id_registro | \n",
" origen | \n",
" sector | \n",
" entidad_um | \n",
" sexo | \n",
" entidad_nacimiento | \n",
" entidad_residencia | \n",
" municipio_residencia | \n",
" ... | \n",
" Latitud | \n",
" Longitud | \n",
" Lat_Decimal | \n",
" Lon_Decimal | \n",
" Altitud | \n",
" Cve_Carta | \n",
" Pob_Total | \n",
" Pob_Masculina | \n",
" Pob_Femenina | \n",
" Total De Viviendas Habitadas | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" 2022-06-26 | \n",
" 0ba73d | \n",
" FUERA DE USMER | \n",
" ISSSTE | \n",
" CIUDAD DE MÉXICO | \n",
" MUJER | \n",
" QUERÉTARO | \n",
" MÉXICO | \n",
" naucalpan de juárez | \n",
" ... | \n",
" 19°28´43.690N\" | \n",
" 099°13´59.585W\" | \n",
" 19.478803 | \n",
" -99.233218 | \n",
" 2280 | \n",
" E14A39 | \n",
" 776220 | \n",
" 373698 | \n",
" 402522 | \n",
" 225509 | \n",
"
\n",
" \n",
" 1 | \n",
" 11 | \n",
" 2022-06-26 | \n",
" 04fc18 | \n",
" FUERA DE USMER | \n",
" SEDENA | \n",
" CIUDAD DE MÉXICO | \n",
" HOMBRE | \n",
" CIUDAD DE MÉXICO | \n",
" MÉXICO | \n",
" atizapán | \n",
" ... | \n",
" 19°10´36.328N\" | \n",
" 099°29´16.591W\" | \n",
" 19.176758 | \n",
" -99.487942 | \n",
" 2590 | \n",
" E14A48 | \n",
" 10873 | \n",
" 5250 | \n",
" 5623 | \n",
" 2342 | \n",
"
\n",
" \n",
" 2 | \n",
" 24 | \n",
" 2022-06-26 | \n",
" 3a5f83 | \n",
" FUERA DE USMER | \n",
" SSA | \n",
" CIUDAD DE MÉXICO | \n",
" MUJER | \n",
" CIUDAD DE MÉXICO | \n",
" MÉXICO | \n",
" zumpango | \n",
" ... | \n",
" 19°47´49.643N\" | \n",
" 099°06´02.282W\" | \n",
" 19.797123 | \n",
" -99.100634 | \n",
" 2261 | \n",
" E14A19 | \n",
" 53362 | \n",
" 26058 | \n",
" 27304 | \n",
" 13659 | \n",
"
\n",
" \n",
" 3 | \n",
" 25 | \n",
" 2022-06-26 | \n",
" 3d59ea | \n",
" FUERA DE USMER | \n",
" SSA | \n",
" CIUDAD DE MÉXICO | \n",
" MUJER | \n",
" GUERRERO | \n",
" MÉXICO | \n",
" zumpango | \n",
" ... | \n",
" 19°47´49.643N\" | \n",
" 099°06´02.282W\" | \n",
" 19.797123 | \n",
" -99.100634 | \n",
" 2261 | \n",
" E14A19 | \n",
" 53362 | \n",
" 26058 | \n",
" 27304 | \n",
" 13659 | \n",
"
\n",
" \n",
" 4 | \n",
" 26 | \n",
" 2022-06-26 | \n",
" 284743 | \n",
" USMER | \n",
" ISSSTE | \n",
" CIUDAD DE MÉXICO | \n",
" MUJER | \n",
" CIUDAD DE MÉXICO | \n",
" MÉXICO | \n",
" nezahualcóyotl | \n",
" ... | \n",
" 19°24´31.548N\" | \n",
" 099°01´05.520W\" | \n",
" 19.408763 | \n",
" -99.018200 | \n",
" 2262 | \n",
" E14A39 | \n",
" 1072676 | \n",
" 515678 | \n",
" 556998 | \n",
" 297975 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 59 columns
\n",
"
"
],
"text/plain": [
" Unnamed: 0 fecha_actualizacion id_registro origen sector \\\n",
"0 8 2022-06-26 0ba73d FUERA DE USMER ISSSTE \n",
"1 11 2022-06-26 04fc18 FUERA DE USMER SEDENA \n",
"2 24 2022-06-26 3a5f83 FUERA DE USMER SSA \n",
"3 25 2022-06-26 3d59ea FUERA DE USMER SSA \n",
"4 26 2022-06-26 284743 USMER ISSSTE \n",
"\n",
" entidad_um sexo entidad_nacimiento entidad_residencia \\\n",
"0 CIUDAD DE MÉXICO MUJER QUERÉTARO MÉXICO \n",
"1 CIUDAD DE MÉXICO HOMBRE CIUDAD DE MÉXICO MÉXICO \n",
"2 CIUDAD DE MÉXICO MUJER CIUDAD DE MÉXICO MÉXICO \n",
"3 CIUDAD DE MÉXICO MUJER GUERRERO MÉXICO \n",
"4 CIUDAD DE MÉXICO MUJER CIUDAD DE MÉXICO MÉXICO \n",
"\n",
" municipio_residencia ... Latitud Longitud Lat_Decimal \\\n",
"0 naucalpan de juárez ... 19°28´43.690N\" 099°13´59.585W\" 19.478803 \n",
"1 atizapán ... 19°10´36.328N\" 099°29´16.591W\" 19.176758 \n",
"2 zumpango ... 19°47´49.643N\" 099°06´02.282W\" 19.797123 \n",
"3 zumpango ... 19°47´49.643N\" 099°06´02.282W\" 19.797123 \n",
"4 nezahualcóyotl ... 19°24´31.548N\" 099°01´05.520W\" 19.408763 \n",
"\n",
" Lon_Decimal Altitud Cve_Carta Pob_Total Pob_Masculina Pob_Femenina \\\n",
"0 -99.233218 2280 E14A39 776220 373698 402522 \n",
"1 -99.487942 2590 E14A48 10873 5250 5623 \n",
"2 -99.100634 2261 E14A19 53362 26058 27304 \n",
"3 -99.100634 2261 E14A19 53362 26058 27304 \n",
"4 -99.018200 2262 E14A39 1072676 515678 556998 \n",
"\n",
" Total De Viviendas Habitadas \n",
"0 225509 \n",
"1 2342 \n",
"2 13659 \n",
"3 13659 \n",
"4 297975 \n",
"\n",
"[5 rows x 59 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# segmentamos nuestro conjunto de datos de areas_inegi\n",
"areas_inegi_tm = areas_inegi.loc[areas_inegi['Cve_Loc'] == 1]\n",
"\n",
"# realizamos nuestro merge\n",
"inner_merge = pd.merge(covid_nacional_tm, areas_inegi_tm, how='inner', on='municipio_residencia')\n",
"print(inner_merge.shape)\n",
"inner_merge.head()"
]
},
{
"cell_type": "markdown",
"id": "f888a1a8",
"metadata": {},
"source": [
"¡Eureka! Tenemos una tabla con el resultado esperado. ¿Cómo lo sabemos? (15 filas parece un número sospechoso) Porque coincide con el número de valores no nulos que nos regresa la columna `municipio_residencia` en `covid_nacional_tm`."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "70a6771c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"nezahualcóyotl 5\n",
"zumpango 2\n",
"chalco 2\n",
"naucalpan de juárez 1\n",
"atizapán 1\n",
"atlautla 1\n",
"ciudad fernández 1\n",
"ecatepec de morelos 1\n",
"huixquilucan 1\n",
"Name: municipio_residencia, dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"covid_nacional_tm['municipio_residencia'].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "59a7b30c",
"metadata": {},
"source": [
"## Outer Merge\n",
"\n",
"La unión \"hacia afuera\" es todo lo contrario a un \"inner merge\". En este caso, combinamos todos los campos de `covid_nacional` con todos los campos de `areas_inegi`, incluyendo aquellos que no coincidan. Así lo vemos en un diagrama de Venn:\n",
"\n",
"```{image} ../_static/imgs/merges/outer.jpeg\n",
":align: center\n",
"```\n",
"\n",
"Apliquemos un \"outer merge\" a nuestro ejemplo anterior para ver los resultados:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "14854150",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(2564, 59)\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" fecha_actualizacion | \n",
" id_registro | \n",
" origen | \n",
" sector | \n",
" entidad_um | \n",
" sexo | \n",
" entidad_nacimiento | \n",
" entidad_residencia | \n",
" municipio_residencia | \n",
" ... | \n",
" Latitud | \n",
" Longitud | \n",
" Lat_Decimal | \n",
" Lon_Decimal | \n",
" Altitud | \n",
" Cve_Carta | \n",
" Pob_Total | \n",
" Pob_Masculina | \n",
" Pob_Femenina | \n",
" Total De Viviendas Habitadas | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 2022-06-26 | \n",
" 0793b8 | \n",
" FUERA DE USMER | \n",
" SSA | \n",
" CIUDAD DE MÉXICO | \n",
" HOMBRE | \n",
" CIUDAD DE MÉXICO | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 2.0 | \n",
" 2022-06-26 | \n",
" 0fef08 | \n",
" USMER | \n",
" SSA | \n",
" CIUDAD DE MÉXICO | \n",
" HOMBRE | \n",
" CIUDAD DE MÉXICO | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 3.0 | \n",
" 2022-06-26 | \n",
" 11e31a | \n",
" FUERA DE USMER | \n",
" SSA | \n",
" CIUDAD DE MÉXICO | \n",
" HOMBRE | \n",
" CIUDAD DE MÉXICO | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 4.0 | \n",
" 2022-06-26 | \n",
" 0741e4 | \n",
" FUERA DE USMER | \n",
" ISSSTE | \n",
" CIUDAD DE MÉXICO | \n",
" HOMBRE | \n",
" CIUDAD DE MÉXICO | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 2022-06-26 | \n",
" 13c92b | \n",
" FUERA DE USMER | \n",
" SSA | \n",
" CIUDAD DE MÉXICO | \n",
" MUJER | \n",
" CIUDAD DE MÉXICO | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 59 columns
\n",
"
"
],
"text/plain": [
" Unnamed: 0 fecha_actualizacion id_registro origen sector \\\n",
"0 1.0 2022-06-26 0793b8 FUERA DE USMER SSA \n",
"1 2.0 2022-06-26 0fef08 USMER SSA \n",
"2 3.0 2022-06-26 11e31a FUERA DE USMER SSA \n",
"3 4.0 2022-06-26 0741e4 FUERA DE USMER ISSSTE \n",
"4 5.0 2022-06-26 13c92b FUERA DE USMER SSA \n",
"\n",
" entidad_um sexo entidad_nacimiento entidad_residencia \\\n",
"0 CIUDAD DE MÉXICO HOMBRE CIUDAD DE MÉXICO NaN \n",
"1 CIUDAD DE MÉXICO HOMBRE CIUDAD DE MÉXICO NaN \n",
"2 CIUDAD DE MÉXICO HOMBRE CIUDAD DE MÉXICO NaN \n",
"3 CIUDAD DE MÉXICO HOMBRE CIUDAD DE MÉXICO NaN \n",
"4 CIUDAD DE MÉXICO MUJER CIUDAD DE MÉXICO NaN \n",
"\n",
" municipio_residencia ... Latitud Longitud Lat_Decimal Lon_Decimal Altitud \\\n",
"0 NaN ... NaN NaN NaN NaN NaN \n",
"1 NaN ... NaN NaN NaN NaN NaN \n",
"2 NaN ... NaN NaN NaN NaN NaN \n",
"3 NaN ... NaN NaN NaN NaN NaN \n",
"4 NaN ... NaN NaN NaN NaN NaN \n",
"\n",
" Cve_Carta Pob_Total Pob_Masculina Pob_Femenina Total De Viviendas Habitadas \n",
"0 NaN NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN NaN \n",
"\n",
"[5 rows x 59 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"outer_merge = pd.merge(covid_nacional_tm, areas_inegi_tm, how='outer', on='municipio_residencia')\n",
"print(outer_merge.shape)\n",
"outer_merge.head()"
]
},
{
"cell_type": "markdown",
"id": "6a978fd3",
"metadata": {},
"source": [
"Vemos que el resultado son 2,564 filas. Es una cantidad bastante grande, pero no es un error. Lo que sucede en este caso, es que se combinan los datos que coincide con los valores de `municipio_residencia` y se combinan los datos que no coinciden. Donde no hay coincidencias, las celdas se llenan con valores nulos.\n",
"\n",
"Para nuestro ejemplo, una unión hacia afuera no es útil. Pero supongamos que tenemos un conjunto de datos de estudiantes con las calificaciones del primer trimestre:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "cad4acde",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Nombre | \n",
" Calificaciones_T1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Andrea | \n",
" 10 | \n",
"
\n",
" \n",
" 1 | \n",
" Berenice | \n",
" 8 | \n",
"
\n",
" \n",
" 2 | \n",
" Carlos | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Nombre Calificaciones_T1\n",
"0 Andrea 10\n",
"1 Berenice 8\n",
"2 Carlos 9"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"estudiantes_trimestre1 = pd.DataFrame(\n",
" {'Nombre': [\"Andrea\", \"Berenice\", \"Carlos\"],\n",
" 'Calificaciones_T1': [10, 8, 9]})\n",
"estudiantes_trimestre1"
]
},
{
"cell_type": "markdown",
"id": "b8514800",
"metadata": {},
"source": [
"Ahora, queremos agregar las calificaciones de otro conjunto de datos correspondiente al segundo trimestre:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "502ee17f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Nombre | \n",
" Calificaciones_T2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Andrea | \n",
" 9 | \n",
"
\n",
" \n",
" 1 | \n",
" Berenice | \n",
" 7 | \n",
"
\n",
" \n",
" 2 | \n",
" Carlos | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Nombre Calificaciones_T2\n",
"0 Andrea 9\n",
"1 Berenice 7\n",
"2 Carlos 8"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"estudiantes_trimestre2 = pd.DataFrame(\n",
" {'Nombre': [\"Andrea\", \"Berenice\", \"Carlos\"],\n",
" 'Calificaciones_T2': [9, 7, 8]})\n",
"estudiantes_trimestre2"
]
},
{
"cell_type": "markdown",
"id": "2cb1fc9b",
"metadata": {},
"source": [
"En este caso, un 'outer merge' es indicado:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f3c14823",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Nombre | \n",
" Calificaciones_T1 | \n",
" Calificaciones_T2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Andrea | \n",
" 10 | \n",
" 9 | \n",
"
\n",
" \n",
" 1 | \n",
" Berenice | \n",
" 8 | \n",
" 7 | \n",
"
\n",
" \n",
" 2 | \n",
" Carlos | \n",
" 9 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Nombre Calificaciones_T1 Calificaciones_T2\n",
"0 Andrea 10 9\n",
"1 Berenice 8 7\n",
"2 Carlos 9 8"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"outer_merge = pd.merge(estudiantes_trimestre1, estudiantes_trimestre2, how='outer', on='Nombre')\n",
"outer_merge"
]
},
{
"cell_type": "markdown",
"id": "35f1cf76",
"metadata": {},
"source": [
"## A modo de cierre\n",
"\n",
"Es necesario que selecciones tu método de unión de acuerdo a la lógica de tus datos. Revisa los resultados de tus combinaciones y ajusta los datos para que lleguen al resultado esperado. En ocasiones esto no es un proceso simple, tienes que reducir los datos y hacerlos manejables para que sea mucho más sencillo llegar a encontrar el error. Todas estas son prácticas que debes tener en cuenta y que no dependen del lenguaje de programación que estes utilizando.\n",
"\n",
"## Para saber más\n",
"\n",
"Existen otros métodos de unión de conjuntos de datos como el 'left merge' y el 'right merge', el método 'join' o la concatenación ('concat'). Para no extendernos más, quisiera sugerirles el capítulo 2 del libro de {cite}`stepanek_thinking_2020` donde se explica de manera sintética cómo utilizar estos métodos."
]
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "-all",
"formats": "md:myst",
"text_representation": {
"extension": ".md",
"format_name": "myst",
"format_version": 0.13,
"jupytext_version": "1.14.0"
}
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"source_map": [
14,
22,
36,
42,
48,
53,
73,
76,
84,
87,
96,
103,
107,
109,
123,
131,
135,
137,
149,
153,
159,
164,
168,
173,
177,
180
]
},
"nbformat": 4,
"nbformat_minor": 5
}