If you're viewing this on your phone, I'd recommend enabling desktop mode. In Google Chrome, you can do so by tapping the three dots at the top right of your screen. Desktop Site
usually has a checkbox next to it that you fill to enable the change. You can zoom in and out of any section with your fingers. My apologies for the hassle and thanks for reading!
Here, I go through an exploratory analysis of Baltimore city employee salaries. I obtained all data through the City of Baltimore's Open Baltimore Data platform (OBD). I wanted to describe and illustrate a process for taking publicly available data and using it to conduct basic data analysis and visualization. All tools used here are free and open source (Google Cloud is a slight exception, but the free tier will get you pretty far).
I created a small Python function that batches calls to the OBD Rest API. I connected the GitHub repo I stored the Python code in to a gcloud source repo. Each time I push a change to the master
branch, GH notifies gcloud and Cloud Build uses the updated cloud source repo to deploy a new version to Cloud Function. Since OBD appears to only update the data once per year or so, I run the function manually. However, Cloud Scheduler would run my function at any interval I want using cronjobs.
Once the function finishes querying OBD, it stores the data in a csv
in Cloud Storage. From there, I connect the csv
to a BigQuery table. I considered updating the function to write directly to the BigQuery table, but I want to see how regularly OBD updates this data set going forward and if they change the schema. They've changed the schema for this dataset at least once that I've seen. I'll add some other details about this data set and a flow chart in the technical notes below.
firstName
, lastName
, and middleInitial
fields with AgencyID
to create an employeeSlug
that I use to track workers across organizations. With just the Name
and AgencyID
fields to work with in this release, tracking individual workers is difficult, though not impossible. It'll definitely be the work of a future analysis. This initial analysis focusses on agency-level salary measures.HireDate
field. I wanted to keep the initial anaylsis as robust as possible, so I elected to clean the data in this way. I also excluded records that were missing data in the annualSalary
fields since those fields were essential to my measures for this initial analysis.AgencyName
values because the data set duplicated some agency names, made distinctions inside of an agency I didn't want to keep, etc. For example, OBD broke out "Rec & Parks" to "Rec & Parks - Parks", "Rec & Parks - part-time", "Rec & Parks - Admin", etc.Gotta get the data in! As I noted in the beginning, I stored this data in a BigQuery table. So the first step is to query the table and bring in the salary and employee name data I want. After I created a Google Cloud account and project, I installed the Google Cloud SDK for Python. Lots of trial and error, but overall a pretty straigtforward process. Unfortunately, I don't have a very comprehensive set of links to show y'all how to get started on Google Cloud if you wanted to recreate this setup, but I'll list the basic steps below:
project_id
and nothing else to connect to my gcloud instances. Highly recommend this and it was very easy.Once all that is done, you can run some version of the code below and bring your data in for analysis.
# import necessary libraries
from google.cloud import bigquery
import pandas as pd
import numpy as np
import datetime as dt
from dateutil import relativedelta as rd
import plotly.express as px
# build client to connect to open-baltimore project
project_id = "open-baltimore-data"
client = bigquery.Client(project=project_id)
# import employee name info
employee_salaries_query = client.query(
"""
SELECT
info.employeeSlug,
main.objectId,
TRIM(REGEXP_REPLACE(main.agencyName, r'\(.*?\)', '')) as agencyName,
main.agencyId,
main.annualSalary,
main.fiscalYear,
main.hireDate
FROM city_employee_salaries.main as main
LEFT JOIN city_employee_salaries.employee_info as info
ON main.ObjectId = info.ObjectId
"""
)
employee_salaries = employee_salaries_query.result().to_dataframe()
Here, I created my own dictionary for translating existing city agency names to my own. I also filtered out records with no salary data. Then I used map
to populate a cleanAgencyName
column with the new names. map
loops over each record in the existing agencyID
field, checks my dictionary for a match, then returns the matches in the new column. It's a super fast function and the syntax is very simple. Lastly, I converted the hireDate
field to datetime
format so I can calculate tenure
.
# clean fields for analysis
agency_dict = {
"A01": "Mayors Office",
"A02": "City Council",
"A02": "Mayors OED",
"A04": "Rec & Parks",
"A05": "MONSE",
"A06": "Housing & Community Dev",
"A08": "M-R Human Services",
"A09": "Liquor License Board",
"A10": "Mayors Office of Children & Families",
"A11": "Office of the Inspector General",
"A12": "Finance",
"A14": "Finance",
"A15": "Comptrollers Office",
"A16": "Comptrollers Office",
"A17": "Finance",
"A18": "Finance",
"A19": "City Planning",
"A23": "Finance",
"A24": "Comptroller - Audits",
"A26": "M-R Labor Commissioner",
"A28": "Wage Commissioner",
"A29": "States Attorneys Office",
"A30": "Law Department",
"A31": "Circuit Court",
"A32": "Finance",
"A33": "Legislative Reference",
"A35": "Elections",
"A37": "Orphans Court",
"A38": "Sheriffs Office",
"A39": "311",
"A40": "BCIT",
"A41": "DPW - Admin",
"A44": "M-R Cable & Comms",
"A46": "Environmental Control Board",
"A49": "Transportation",
"A50": "DPW - Waste & Wastewater",
"A51": "Office of Equity & Civil Rights",
"A52": "Employee Retirement System",
"A53": "Finance",
"A54": "Retirement - Fire & Police",
"A57": "City Council Services",
"A64": "Fire Department",
"A65": "Health Department",
"A67": "Rec & Parks",
"A68": "Rec & Parks",
"A70": "DPW - Solid Waste",
"A73": "Municipal Zoning & Appeals",
"A75": "Enoch Pratt Free Library",
"A83": "Human Resources",
"A84": "Transportation",
"A85": "General Services",
"A86": "War Memorial Commission",
"A88": "Comptroller - Comms",
"A90": "Transportation",
"A91": "Convention Center",
"A99": "Police Department",
"A9": "Police Department",
"B49": "Transportation",
"B68": "Rec & Parks",
"B70": "DPW - Solid Waste",
"BPD": "Police Department",
"C90": "Transportation - Crossing Guards",
"P04": "Rec & Parks",
"P65": "Health Department",
"P83": "HR Test Monitor",
"R01": "R01",
"U01": "U01",
"SCS": "Special City Services",
"W02": "Youth Summer Works",
"W03": "Youth Cust",
"W07": "Youth Temp Adult",
"W08": "TANF Cust"
}
employee_salaries_clean = employee_salaries.copy(deep=True)
employee_salaries_clean = employee_salaries_clean[employee_salaries["annualSalary"].notnull()
& employee_salaries["annualSalary"] != 0]
employee_salaries_clean['cleanAgencyName'] = employee_salaries_clean['agencyId'].map(agency_dict)
employee_salaries_clean = employee_salaries_clean[employee_salaries_clean["hireDate"].notnull()]
employee_salaries_clean['hireDate'] = pd.to_datetime(employee_salaries_clean["hireDate"], unit="ms")
# may need to implement check and ensure tenure across unique employees is the same
employee_salaries_clean['tenure'] = \
employee_salaries_clean['hireDate'].map(lambda hire_date: rd.relativedelta(dt.datetime.now(), hire_date).years)
Once I noticed I had actual employee names, I really wanted to try to track employees across years. I tried to create a slug but since there's only one Name
field, employees with multiple records can have one record with a middle initial and one without. This would look something like Owens,Dyan-S-D01
and Owens,Dyan-NA-D01
labeled as different employees when they're the same person. I don't imagine this is a huge proportion of the records and there's a couple cleaning methods I could employ to make the employeeSlug
more reliable. Once again, that'll probably be in a future analysis.
Regardless, I wanted to start by counting salary records by agency. This got me to the 500 record cutoff I used to keep my measures robust. Then, I ran measures by employee and agency. I calculated 7 measures to start:
medSalary
highestSalary
lowestSalary
growth
salaryRecords
tenure
hireDate
field I formatted earlier in the analysis. Each hireDate
record for each individual employee should be the same. So I can use the first
function to simply take the tenure
field I calculated for each record. It's somewhat of a dynamic field since I get it by subtracting the number of years between the current date and the hire date provided. It rests on a lot of assumptions and is not a very straightforward measure to calculate. I definitely want to find a better way to calculate this going forward.avgRaise
Once I made it to the breakdown by city agency, I took most of the measures listed above and calculated their median. I chose the median because I figured like many organizations, the city probably pays those at the top very well and those at the bottom...less well. The mean is very sensitive to these outliers and the median is not. I'd love to do a modal breakdown in a future analysis. I'd probably create pay ranges of some kind so we can see what percentage of employees across the city fall into different ranges.
# count number of records per city agency
agency_record_count = employee_salaries_clean.groupby(["cleanAgencyName"], as_index=False).agg(
salaryRecords=pd.NamedAgg(column="cleanAgencyName", aggfunc="count")
)
agencies_with_500_records = agency_record_count[agency_record_count["salaryRecords"] >= 500]["cleanAgencyName"].values.tolist()
# calculate mean salary, max salary, min salary, growth, salary records, tenure, average raise by employee and agency
employee_salary_quality = employee_salaries_clean.groupby(["employeeSlug", "cleanAgencyName"], as_index=False).agg(
medSalary=pd.NamedAgg(column="annualSalary", aggfunc="median"),
highestSalary=pd.NamedAgg(column="annualSalary", aggfunc="max"),
lowestSalary=pd.NamedAgg(column="annualSalary", aggfunc="min"),
growth=pd.NamedAgg(column="annualSalary", aggfunc=lambda salary: max(salary) - min(salary)),
salaryRecords=pd.NamedAgg(column="annualSalary", aggfunc="nunique"),
tenure=pd.NamedAgg(column="tenure", aggfunc="first"),
avgRaise=pd.NamedAgg(column="annualSalary", aggfunc=lambda salary: (max(salary) - min(salary)) / len(salary))
)
agency_salary_quality = employee_salary_quality.groupby(["cleanAgencyName"], as_index=False).agg(
medSalary=pd.NamedAgg(column="medSalary", aggfunc="median"),
medHighestSalary=pd.NamedAgg(column="highestSalary", aggfunc="median"),
medLowestSalary=pd.NamedAgg(column="lowestSalary", aggfunc="median"),
medGrowth=pd.NamedAgg(column="growth", aggfunc="median"),
medTenure=pd.NamedAgg(column="tenure", aggfunc="median"),
medRaise=pd.NamedAgg(column="avgRaise", aggfunc="median")
).query("cleanAgencyName in @agencies_with_500_records")
agency_salary_quality.style.hide(axis="index")
cleanAgencyName | medSalary | medHighestSalary | medLowestSalary | medGrowth | medTenure | medRaise |
---|---|---|---|---|---|---|
BCIT | 46319.000000 | 46747.000000 | 45963.000000 | 932.000000 | 17.000000 | 466.000000 |
Circuit Court | 56363.000000 | 60411.000000 | 54049.000000 | 0.000000 | 14.000000 | 0.000000 |
Convention Center | 37752.000000 | 39271.000000 | 36167.000000 | 600.000000 | 14.000000 | 228.000000 |
DPW - Admin | 55511.000000 | 57653.000000 | 53191.000000 | 0.000000 | 10.000000 | 0.000000 |
DPW - Solid Waste | 35615.000000 | 37613.000000 | 34411.000000 | 0.000000 | 16.000000 | 0.000000 |
DPW - Waste & Wastewater | 40646.000000 | 44151.000000 | 38411.000000 | 1636.000000 | 17.000000 | 552.666667 |
Enoch Pratt Free Library | 40570.000000 | 42959.000000 | 37677.000000 | 0.000000 | 13.000000 | 0.000000 |
Finance | 43155.000000 | 45589.000000 | 41515.000000 | 0.000000 | 14.000000 | 0.000000 |
Fire Department | 70393.000000 | 73583.000000 | 62175.000000 | 6194.000000 | 18.000000 | 1446.400000 |
General Services | 43784.000000 | 48637.000000 | 41771.000000 | 1159.000000 | 15.500000 | 487.250000 |
Health Department | 42907.000000 | 44153.000000 | 40823.000000 | 0.000000 | 12.000000 | 0.000000 |
Housing & Community Dev | 49871.000000 | 52639.000000 | 46583.000000 | 0.000000 | 14.000000 | 0.000000 |
Police Department | 68165.000000 | 75183.000000 | 61183.000000 | 1470.000000 | 16.000000 | 685.000000 |
Rec & Parks | 35615.000000 | 37571.000000 | 33849.000000 | 0.000000 | 9.000000 | 0.000000 |
Sheriffs Office | 50696.000000 | 54059.000000 | 46495.000000 | 2870.000000 | 18.000000 | 1027.000000 |
States Attorneys Office | 62481.000000 | 64957.000000 | 57699.000000 | 0.000000 | 10.000000 | 0.000000 |
Transportation | 39745.000000 | 42455.000000 | 37553.000000 | 998.000000 | 17.000000 | 392.000000 |
Transportation - Crossing Guards | 10822.000000 | 11875.000000 | 9891.000000 | 0.000000 | 16.000000 | 0.000000 |
I don't use a lot of box plots, but here it works well. The boxes show you the maximum, minimum, and median salary for each agency. You can also see salaries at the 25th (q1) and 75(q3) percentiles. The numbers here are not grouped by individual employee, so they may look a little different than the table above.
# add salary box plot
ordered_salary_data = employee_salaries_clean.query("cleanAgencyName in @agencies_with_500_records").loc[:, ["cleanAgencyName", "annualSalary"]] \
.groupby(["cleanAgencyName"]) \
.median() \
.sort_values(by="annualSalary", ascending=False)
fig = px.box(employee_salaries_clean.query("cleanAgencyName in @agencies_with_500_records"),
x="cleanAgencyName",
y="annualSalary",
points=False,
category_orders={"cleanAgencyName": ordered_salary_data.index.to_list()},
labels=dict(cleanAgencyName="City Agency", annualSalary = "Annual Salary ($)")
)
fig.show()