Tag Archives: datascience

Useful datasets for Howard County election analysis

tl;dr: I release two useful Howard County election datasets in preparation for future posts.

In the coming days and weeks I’ll be posting some analyses of Howard County election results. Unfortunately the data released by the Howard County Board of Elections and the Maryland State Board of Elections is not always in the most useful form for analysis. In particular I was looking for per-precinct turnout statistics for the 2014 general election in Howard County, along with some way to match up precincts with the county council district of which they’re a part. That data is available in the 2014 general election results per precinct/district published by the Howard County Board of Elections, but unfortunately that document is a PDF document.

PDF files are great for reading by humans, but lousy for reading by machines. They violate guideline 8 in the Open Data Policy Guidelines published by the Sunlight Foundation:

For maximal access, data must be released in formats that lend themselves to easy and efficient reuse via technology. … This means releasing information in open formats (or “open standards”), in machine-readable formats, that are structured (or machine-processable) appropriately. … While formats such as HTML and PDF are easily opened for most computer users, these formats are difficult to convert the information to new uses.

Since the data I wanted wasn’t in a format I could use, I manually extracted the data from the PDF document and converted it into a useful format (Comma Separated Value or CSV format) myself. Then since someone else might find a use for them, I published the files online in a datasets area of my Github hocodata repository. The first two files are as follows:

  • hocomd-2014-precinct-council.csv. This dataset maps the 118 Howard County election precincts to the county council districts in which those precincts are included.
  • hocomd-2014-general-election-turnout.csv. This dataset contains turnout statistics for each of the 118 Howard County precincts in the 2014 general election, including the number of registered voters and ballots cast in each precinct on election day.

Stay tuned for some interesting ways to use this data.

Fun with Howard County building permit data

tl;dr: I have fun creating graphs and maps with building permit data from data.howardcountymd.gov.

I’ve written previously about the cornucopia of interesting data sets that Howard County government has made available at the data.howardcountymd.gov site. I had some spare time over a long weekend and decided to try analyzing some of that data, including making use of the various map files on the site (under the “Spacial Data (GIS)” tab).

The particular data set I decided to start with was for building permits issued for residential and commercial construction—not because I have a burning interest in building permits but because I mentioned this type of data in my last post and thought it would be a relatively easy data set to analyze. The particular question I decided to look at was how many residential building permits were issued in each zip code within Howard County in 2014—basically to get a feel for where the most construction was occurring in the county. (It’s only an approximate measure because some permits cover multiple units.)

bar chart showing Howard County residential building permits per zip code

To do the analysis I used the skills and the tools I learned in the courses that are part of the Johns Hopkins data science specialization series on Coursera. (See my Coursera-related posts for more on my experiences in these classes.) I won’t go over the process here since I’ve separately published full details on my RPubs page, with the source code available in my hocodata GitHub repository.

I first created a simple table of the top zip codes for residential permits issued. This was sort of boring so I won’t reproduce it here; you can find it in the first example analysis I did. More interesting is the bar chart I created as part of the second example. It’s clear from the chart that there’s wide variation among Howard County zip codes in terms of residential construction. The two Ellicott City zip codes combined (21042 and 21043) accounted for the largest fraction of residential building permits in 2014; in contrast there were almost no permits issued for east Columbia (21045).

Howard County map showing residential building permits per zip code

However what I really wanted to create was a map showing exactly where permits were being issued across the county. The Howard County GIS division provides on data.howardcountymd.gov a set of map data for zip codes within Howard County. After doing a bit of research and experimentation, in my third example I was able to use this in conjunction with the building permit data to produce a map that is a nice alternative to the bar chart.

I have to stop here and ask the unspoken question: What’s the point of all this? I’d answer as follows:

First, this shows that releasing government data empowers people to do interesting things with it, especially when combined with free software and easily available online information and training. Maybe everybody isn’t interested in building permit data or any other individual government data set, but I suspect that there are a fair amount of people out there who are, including small businesses, nonprofit organizations, or just individual activists and interested citizens.

Second, I did all this in a way that is completely reproducible by anyone else. How often have you seen a graph or map in a newspaper or government report and wondered, where exactly did that data come from? Wonder no longer: In my examples I start with the raw data as released by Howard County and show all my work in analyzing the data and creating the tables, charts, and maps.

Finally, this is all reusable and adaptable. For example, suppose you have a better source of data on construction activity, perhaps one that gives the actual numbers of residential units, commercial square footage, and so on. You can easily plug that modified data into the analysis steps I’ve documented, and create better versions of the charts and maps in my examples.

You can also reuse the overall technical approach for any type of data tied to a geographic area within Howard County. For example, in addition to zip code areas the data.howardcounty.gov site contains map data for Howard County school districts, election precincts, census tracts, and many other subdivisions of the county. If you have data sets that are based on those subdivisions (for example, vote totals or turnout percentages for precincts) then you can adapt the code I wrote (all of which is in the public domain) to create your own maps showing how that data varies across the county.

The bottom line is that the data is out there for the picking, as are the tools to make sense of it. You just need to spend some time learning how to use them or (if you don’t feel up to the task yourself) finding someone who can. Have fun!

Howard County government by the numbers

tl;dr: As we wait to hear more about Allan Kittleman’s HoCoStat proposal, you don’t have to wait to download lots of useful county-related data at data.howardcountymd.gov.

During his (ultimately successful) campaign for Howard County Executive, one of Allan Kittleman’s key proposals was to establish HoCoStat, a program to (in Kittleman’s words), “measure … response and process times for various government functions” to help “increase responsiveness, improve efficiency and heighten accountability”. Kittleman’s administration is in its early days, and nothing much has been heard yet about how and when HoCoStat might be implemented. (Even the original HoCoStat proposal has disappeared from Kittleman’s web site as it’s being redesigned, although the Internet archive has a copy.)

But don’t despair! While we’re waiting for HoCoStat to make an appearance there’s other Howard County data-related resources we can explore. In particular, the data.howardcountymd.gov site has a good and growing collection of county-related datasets, many of them tied to county maps—no surprise, since the site is maintained by the county’s Geographic Information System (GIS) Division. Part of what makes the site great is that it is not just presenting predefined maps and PDF documents, but also provides the raw data used to create those maps.

For example, suppose you’re interested in building permits issued in Howard County. At the simplest level you can view an interactive map showing the locations for all such permits; you can click on the icons corresponding to the issued permits and see the exact address, date when the permit was issued, and other information.

But let’s suppose you want to do more in-depth analysis of permits issued: For example, which areas are seeing the most residential or commercial permits issued? Or, what is the trend for permits issued over time? The data.howardcountymd.gov site also lets you download the raw data behind the map in a variety of formats, for example in CSV format for use with Excel spreadsheets or statistical software like R, KML format for use with Google Maps and Google Earth, and several others. Armed with the relevant data files you can create your own maps and do your own analysis, including combining the Howard County data with data from other sources like U.S. Census data.

All in all the site—which is still evolving—is a model for how Howard County government can make useful data available to the Howard County individual and corporate taxpayers who are ultimately paying for county services. It would be great to see this strategy extended to HoCoStat as well. For example, when promoting the HoCoStat proposal Allan Kittleman pointed to (among others) Montgomery County’s CountyStat site as a model to emulate. While CountyStat is very nice, it has the disadvantage that you can’t see the raw data behind the performance indicators.

For example, CountyStat has some summary statistics relating to issuance of building permits: average number of days to issue a residential permit, commercial permits for new construction, or other commercial permits. But there’s a lot more one might want to know: For example, what’s the variability in the time to issue permits? Are there some permits that for whatever reason took a really long time to issue? How does the time to issue permits vary across the county? Are there particular areas that (for whatever reason) are experiencing greater or lesser delays in getting permits issued? Having the raw data behind the indicators would permit (no pun intended) interested parties to answer these questions, from commercial developers doing large-scale projects down to a small contractor building a single home.

As I wrote in my previous post on Howard County government data initiatives, providing unfettered access to raw data (subject to reasonable concerns relating to individual privacy and corporate confidentiality) is key to making government data useful: It allows the private and civic sectors to exercise their own creativity in using that data, rather than trying to have government anticipate every possible use for it, and also lets the private and civic sectors hold government accountable by enabling them to do their own independent analyses of government data. It’s great to see what Howard County government (and the GIS Division in particular) has been and is doing to make useful data generally available. I hope that as the Kittleman administration gets down to work and the HoCoStat program is implemented that that spirit of openness and commitment to serve citizens through government data continues.

Online competency-based education

Following up from my previous post on my experience with Coursera, here are a few links of interest (mostly) relating to online education, with a focus on “competency-based education”, i.e., education directed specifically at teaching people to become competent at one or more tasks or disciplines:

Hire Education: Mastery, Modularization, and the Workforce Revolution” (Michelle Weise and Clayton Christensen). Clayton Christensen is famous for his theory of “disruptive innovation”, which I think is useful not so much as a proven theory but rather as a way to structure plausible narratives about business success or failure. When Christensen fails in his predictions it’s usually because he doesn’t pay attention to things that don’t fit neatly into his preferred narratives. For example, he and co-author Michael Horn previously hyped for-profit education companies and failed to see that for many of them actually educating students was not the point. Rather those companies identified a “head I win, tails you lose” business proposition in “chasing Title IV money [i.e., government-subsidized student loans] in a federal financial aid system ripe for gaming”. This represents a second try by Christensen and his associates to forecast the future of post-secondary education.

The MOOC Misstep and the Open Education Infrastructure” (David Wiley). One of Clayton Christensen’s blind spots is that he tends to overlook what’s going on in the area of not for profit endeavors. In his blog “Iterating toward Openness” David Wiley covers the general area of open educational resources (or OER); this post is a good introduction to his thinking.

Web Literacy Map (Mozilla project). A real-world example of the sort of competency-based open education initiative that Wiley’s promoting. See also the Open Badges project, a Mozilla-sponsored initiative to create an open infrastructure for granting and publishing credentials.

A Smart Way to Skip College in Pursuit of a Job (Eduardo Porter for the New York Times). “Nanodegrees” are online education provider Udacity’s own take on competency-based education, created in cooperation with major employers.

Missing Links: How Coding Bootcamps Are Doing What Higher Ed and Recruiting Can’t” (Robert McGuire for SkilledUp). You may be beginning to see a trend here: A lot of the action in competency-based training is around software development, data science, and related fields. That’s because there’s high demand for skilled employees in certain fields and a lack of truly-focused traditional educational offerings to meet that demand. A related trend: Sites like SkilledUp that are trying to be become trusted guides to these new-style offerings.

Last but not least, here are some other people’s reviews of the Johns Hopkins Data Science Specialization courses on Coursera that I’m currently taking:

From a local point of view these changes (if indeed they continue and are amplified) are not likely to affect high-end universities like Johns Hopkins; they’ll survive based on their ability to select the most talented applicants and plug them into a set of networks that will maximize their chances of success.1 The question is rather how they’ll affect institutions like Howard Community College that serve a broader student population that’s looking to acquire job-relevant skills.

1. Note that from this point of view online offerings like the John Hopkins Data Science Specialization help to promote the institution and identify potential applicants. In fact, just this week I received an email from the Bloomberg School of Public Health inviting me to attend one of their “virtual info sessions” for people considering applying.

Adventures in online education

The last three months or so I’ve been in school (which is why I haven’t been posting as much lately). Not a real bricks-and-mortar school—I’ve been participating in the “Data Science Specialization” series of online courses created by faculty at the Johns Hopkins Bloomberg School of Public Health and offered by Coursera, a startup in the online education space. It’s been an interesting experience, and well worth a blog post.

The obvious first question is, why I am doing this? Mainly because I thought it would be fun. I was an applied mathematics (and physics) major in college, enjoyed the courses I had in probability, statistics, stochastic processes, etc., and wanted to revisit what I had learned and (for the most part) forgotten. It’s one of my hobbies—a (bit) more active one than watching TV or reading. Also, I’ve done some minor fiddling about with statistics on the blog (for example, looking at Howard County election data), am thinking about doing some more in the future, and wanted to have a better grounding in how best to do this. Finally, “data scientist” is one of the most hyped job categories in the last few years, and even though I probably won’t have much occasion to use this stuff in my current job it certainly can’t hurt to learn new skills in anticipation of future jobs.

The next question is, why an online course? Because I didn’t have the time (or the money) to commit to attending an in-person class, but I wanted the structure that a formal class provides. I’ve been (re)learning linear algebra out of a textbook for over four years now, and I still haven’t gotten past chapter 3. Part of the reason is that I’m doing every exercise and blogging about it, but mainly it’s that I don’t have an actual deadline to finish my studies. In the Coursera series there are nine courses, each lasting a month, with quizzes every week and course projects every 2-4 weeks depending on the course. I’ve been doing pretty well in the courses thus far and don’t want to spoil my record. For example, the first project in the current class was due Sunday but I was concerned about missing the deadline and so finished it last Friday night.

I like the way the series of courses is structured as well, not just as a class in statistics (only) but covering the whole range of skills needed to wrangle with data in its various forms, not least including the problems of getting datasets and cleaning them up. Each class thus far has only been a month long, so the time commitment is not that great and I know any work I do today will pay off in a completed course not too far down the road. It is a fairly serious commitment of time though, especially since the course video lectures cover only a fraction of what you need to know in order to do the course projects and correctly answer the more difficult quiz questions. I’ve probably spent almost 10 hours each week working on various aspects of the classes, including doing a copious amount of Internet searching to find out the additional information I need. But it’s been time well-spent: I feel like I’m getting a good understanding of how to do “data science” tasks—not that I know everything, but I have a much better picture of what I need to know, and what it would take to finish learning it.

The course I’m currently taking (“Exploratory Data Analysis”), like the others in the series, is what’s been referred to as a MOOC, or “massive open online course”, open at no charge to anyone in the world who wants to participate over the Internet. The instructors provide video lectures and create the quizzes and class projects but are not otherwise directly involved; the students provide help to each other in online discussion forums, assisted by “community TAs”, i.e., former students who volunteer as teaching assistants. MOOCs have recently been the subject of both hype and caution; now that I’ve been involved in them day-to-day I can provide a personal perspective on the controversy.

First, I think MOOCs are good for the sort of people who invented them in the first place: Internet-savvy folks with a technological bent who are motivated to learn something and have the necessary free time and background experience and knowledge to do so effectively. I’ve certainly appreciated having convenient no-charge access to a wide variety of classes, many of which (like the courses I’m taking now) have been put together by people who are leaders and innovators within their fields. I’d even consider paying for at least some of these courses (at $49 each) in order to get a more formal “verified certificate” (as opposed to a “statement of accomplishment”, and may do so for later courses within this series—potentially good news for Coursera, which in the end is a profit-making enterprise.

However for people who are not Internet-savvy, not all that motivated, and don’t have the necessary background then MOOCs aren’t a good choice. In fact, they’re about the worse choice there is. The dropout rates in MOOCs are extremely high (well above 90% in many cases), and the first serious test of MOOCs as a replacement for in-person college courses (at San Jose State University) was not a raging success. Which is not to say that online learning in general is doomed; in its more traditional forms (for example, University of Maryland University College) it’s doing quite fine.

MOOCs are simply the latest in a long line of attempts to move away from the traditional classroom model and “disrupt” the existing educational establishment. They’ll eventually find a place in the overall educational picture, most likely serving a variety of needs from “learning as hobby” (what I’m doing), high-end vocational education (what Coursera competitor Udacity seems to be morphing into), or as a supplement to traditional classes. But that’s for the future, and no real concern of mine; in the meantime I’m just trying to learn how to plot in R.