Of course, the title is an overstatement. For the first time, I joined the hackathon in Prague, the Czech Republic this weekend, called "hack the state" (Hackuj stát) and it was about using government's Open Data to do some interesting projects and demonstrate that it's worth publishing government's data. The Czech Republic already have open data initiative with many datasets from many government's institutions. This was the third year of the hackathon. Rules are simple. There is a 24 hours limit to use provided open data from various institutions to create something meaningful. Hackathon started on Friday 13th at 16:00 and lasted until Saturday 14th 18:00. Data sources were available several weeks in advance so it was possible to work on projects earlier. First, there was an overview of data sources and then projects which would be created were presented and teams were assembled (max five people in one team) and people dive into work. In the end, creators have the opportunity to present their work in front of the jury within 3 minutes. The jury picked up the first three projects which were awarded.
Unfortunately for me, I was busy working on projects last few weeks + beginning of September is challenging since school year starts and takes some time to adjust to new biorhythm and younger son started attending first grade, so I didn't have really time to prepare nor think about what I could create there. I came 30 minutes late since I was driving kids and wife to parent's inlaws (away from Prague) and came back. Lastly, the night before I didn't sleep well so I was just tired on Friday afternoon.
The room was already full-packed with participants (my rough estimate between 50-100) and mentors were presenting data sources. When I came, I didn't have an idea what to do, what sources to use. It looked like the majority of people who came were already in teams anyway and knew what they were going to create. My backup plan was to work on a data pipeline, which I started working on some time ago which loads data into BigQuery from various Czech open data sources, but I didn't feel like that would be something worth or presenting, i.e. that wasn't real-life use case of open data, just middle step. I was hoping to do some real-life use case plus potentially try out some new things.
I needed to put myself together so I went into a separate (dining) room where I was more or less alone and try to brainstorm stuff and think what would be my next steps. I started to draw a mind map and this is what came out of that.
In the end nothing concrete I guess, but it kicked off my thinking in a few directions. I knew that doing some Data Analysis (stats, graphs) could be nice and useful after all not all stats are boring when you really find some unknown information and interesting connections, but I felt I could miss target if I'll go that way. On the opposite spectrum was the use of machine learning which would be cool but I felt that it could be a big bite for me, and I wanted to complete the project within 24 hours from start. On the other hand, I also wanted to do something not so obvious and usual (whatever that means).
During data sets presentations, data sets from the Technology Agency of the Czech Republic (TACR) attracted my attention, I guess since I like science and technology. TACR is government organization whose role is to support research and innovation. Data sets contained information about companies and institutions which applied for grants in TACR's projects and were either accepted and rejected.
I talked with "mentor" for these datasets Mr. Martin Vita and just trying to understand datasets more and what could be done with it. During conversation several ideas popped up, ultimately he came up with the idea which he considered useful and something that he and his colleagues could use and it was something like this:
Basically, when the agency opens new projects, companies and institutions can apply for grants for those projects. There are however many companies which don't apply to TACR but apply in other funding institutions for projects in similar areas. So idea is to find for certain areas companies which were participating in projects in other funding institutions but not yet in TACR, or they applied in TACR but were rejected and inform them explicitly about a possible relevant project. This looked like a good use case, doable and potentially could be used, although not so super exciting. We both agree that this wasn't going to dazzle jury, but I was motivated enough to devote time for development. I called it "Company search for cooperation app". I knew the direction and it was a good starting point for me to start with work.
One of the other ideas was to look at connections between companies/institutions participating in the same projects and do something like a graph of connections and see if something interesting could be found there.
Last idea was to do simple Machine Learning and do something like a predictor of application, i.e. you type the name of the project and it will tell you what is a probability of the project with that name to get funding. Of course, that was more like a fun project than something serious. Anyway, I published my ideas/project and hoping someone would join especially with frontend/design skills which are not my strengths but nobody did :(.
I started with real work on Friday around 20:00. Forgot to write that all the time there were refreshments (food & drinks) so physical needs were met. I was focused on doing "Company search app" and if there would be time left I would try to do on something else from the list.
The architecture of the search app was straight forward. I needed some storage solution and some hosting for the backend. Since my working life revolves around Google Cloud there was no brainer. I chose to use BigQuery and App Engine, tools which I'm very well familiar with. Advantage of these products is that they are managed, i.e. there is no need for setup & administration, and they have a nice free tier which for my use case makes it free of charge.
Data were in good shape, I didn't have to do some extra cleaning or similar. I also used data from the "Central system of R&D" which provided all funded projects and institutions involved across all funding agencies in the Czech Republic. I created three tables in BigQuery and started writing an SQL query to get the required data. I'm not sure why, whether because it was Friday evening and I was mentally and physically exhausted but to write this query took me way more time than I expected, which made me more and more frustrated, tired and I want to give up. Ultimately I went to sleep around 3:00 so I could have a rest at least for a few hours and continue with work in the morning. That helped of course and although it wasn't going smoothly in the morning as well I managed to tune it and complete. In the end it looks something like this (can't fit one screen).
Next step was to wrap it in the backend, and do simple UI for users to select options and see returned data. This also didn't go smoothly but it was much less pain than writing SQL query. I had a working prototype of a web application deployed on App Engine using Python and Flask and using Bootstrap to make frontend at least somewhat civilized.
This is how it looks like. I published code for the web application on GitHub https://github.com/zdenulo/hackujstat3
I sent link to the web app to Mr. Vita from TACR who replied me with feedback and some requests which I managed in incorporate and I was done around 15:00. The last thing was to prepare a presentation but other problems starting to pop up. Since May I had tickets for a theatre play for this Saturday and I thought that play starts at 20:00 but my wife reminded me that it starts at 19:00. This was a problem of course since projects presentation should start at 18:00 and I needed to catch public transport at 18:15 which meant I needed to leave building latest 18:10. Fortunately, organizers let me present first and since there was a 3-minute cap for presenting I managed to make it to the train and play. Second thing was, while I was working on the presentation I realized how much my Czech writing is bad since I'm used to writing in English, but I asked one person to review my slides and fix errors. I felt a little insecure presenting as well since I was first and I was under time stress and feel I could more clearly explain my project. On the other hand, I was alone and responsible for everything so in the end, I'm satisfied.
I'm glad that I could participate. The organization was good, as well as the atmosphere. I guess I was so much focused on my work that I didn't have much opportunity to chat with other participants but ultimately I'm glad that I could complete working application (17 out of 20 projects were completed officially) with the hope that it will not turn into dust but will be useful... for a few people on this planet :)
Anyone interested in Data Science / Analysis can/should dig into datasets as well as look at other projects for inspiration on the official website. Hope next year I will be better prepared... a napíšu článek v češtině.