Transitive Closure

I’d planned to write this piece as normal on Friday. But nothing is normal about life just now, although mine seems to have descended into a new kind of normal, both on a physical and a mental level.

Let’s deal with the physical first as that’s a lot easier to explain away.

I stopped going out on the roads three weeks ago because I knew there was going to come a point when I could no longer justify, in Joe Public’s eyes, heading out to the coast for three hours every day. As it now turns out, three hours is way in excess of what the government has decided is sufficient to maintain a good fighting weight, so there’s no way I’d have got away with all the finger pointing and whispering that goes on in a small town. So I went in the shed and that’s where I now bag my miles, on a turbo trainer. For the record, I decided to re-create the Ride2Cure across Australia by riding from Brisbane to Adelaide following a Bluetooth enabled map on what looks like SatNav on my phone. When the connection stays up it’s fine, but I’ve discovered to my cost that it drops out if I so much as answer a phone call or click on a video link in Twitter while I’m cycling.

I’m halfway, although it won’t actually feel like it until I get to Coolamon tomorrow. For Coolamon, read Wagga, because that’s where Paul and I took a rest day on the original Ride2Cure. The clock is currently showing 773 miles (1,244km) so there is justification for raising a glass to a job half done this evening: the original Ride2Cure was 2222km if you remember right.

But more important than being half way is the rate at which I’ve managed to do it. I did think, for a millisecond or two before I virtually left Brisbane, that I should try to re-enact every stage in its entirety. Notice I said a millisecond or two, because it was clear from the moment I got on the bike that this gig was going to take its toll in a different kind of way from what I’ve been used to. When you’re just focussed on a screen (and whatever happens to be blowing about the garden) it’s far too easy to push a (big) gear that you think is okay, but it’s only after about 75 minutes, or when you wake up in the night with a dull ache at the top of your hamstring, that you realise that this is different.

Having said that, I could have taken a 300 mile week this week if I’d wanted it. 293 miles was the most in a single week since I got back on the bike last July, although the three week total of 808 is still some way short of the heady days of last August. With lockdown set to continue well into May I suspect, then the second coming of Ride2Cure is looking odds on to rack up 300 stages on the spin – at an average that’s closing in on 34 miles a day – and at the back of my mind is a full calendar year without losing a day. I once managed that as a runner back in the Cumbernauld days, but I got ill so many times during that year (8 colds I counted) that it was suggested that I might have some kind of immune deficiency. I was tested for the coxsackie B virus at the time but it came back negative. Looking back at that period, I think a better medical term would have been burning the candle at both ends and in the middle.

Yes, the physical side of being on the bike continues to push my body to the limit, but it’s something I can deal with.

I wish I could have said the same about the mental side these last few days.

A few weeks ago, I mentioned on my Facebook timeline that I planned to do so some clinical research into COVID-19 using SNOMED-CT. Two years on from qualifying as a SNOMED developer and building a virtual GP Practice on my laptop (using entirely fictitious patients) I reckoned it would be good to document all of the clinical interfaces between COVID-19 and each of the underlying health conditions that it’s known to be associated with. Hearing about this stuff on the news is one thing: seeing it documented in a database with each of the associated attributes of the disease is something else altogether. Forget fake news: this stuff represents the leading edge of everything that mankind knows about this virus today. It’s all in the database.

I thought that the February release of the SNOMED-CT database from NHS Digital had all of the COVID-19 stuff in it, but after spending the best part of a day loading and indexing 14 million rows of data, it wasn’t there. So I had to sit twiddling my thumbs, researching virtual dog racing instead, waiting on an update. That duly came on Wednesday, April Fools Day if you like irony.

The data’s released in two parts: the full International release, which forms the bulk of the whole thing, and the GB extension, which adds about 25% on top, thereby localising the International Edition.

That was Wednesday taken care of, but fortunately it didn’t take quite as long as the last time because I already had a lot of the data structures prepared, albeit that I’d emptied them of data again.

The COVID-19 stuff makes up a small fraction of the whole thing, but in order to do the research effectively – so that you don’t miss something by mistake – you need the lot: 14 million rows.

But on top of that humungous amount of data sits a reference table that you have to build by running a script (that’s a special program that I got my hands on by virtue of being on the implementor’s course) that builds a Parent-Child table of all of the relationships in the entire database. It’s an absolute beast of a table, yet it only contains two columns of data: a parent id and a child id. In SNOMED, a parent can have multiple children, and a child can have multiple parents. Let that sink in for a moment.

That table is called the Transitive Closure, and it has 6.4 million rows of data.

Creating it is always the last piece in the jigsaw. Previously, whenever I built a SNOMED database, I just went with the International Edition because it had everything that I needed at that time. But this time I needed the GB extensions for the NHS stuff.

My initial plan was to create the Transitive Closure table for the data coming from the International Release, then do the same for the GB data before merging the two result sets.

The GB file failed. Every time. And that was the little one.

The International file worked fine, just like it had always done in the past.

That was on Wednesday night.

So then I thought “I know what I’ll do: I’ll load both data sets into the database, then create a single comma delimited text file with a SQL query.”

Piece of piss.

Then I ran the special script on the output file. It ran for hours. It seemed to be doing something, but it just went on, and on, and on. After 24 hours, I killed it. There appeared to be no rhyme nor reason why it should take so long. The same script running on the International data took five minutes: so why would adding another 25% still be running after 24 hours?

These are things that drive programmers round the bend.

Anyway, while I was virtually cycling the 40 miles from Quandialla to Temora in New South Wales today, I had an idea. Quite why I never thought of this before is beyond me, but maybe I can get away with being 67 and such ideas being consigned to my more productive past…

Why don’t I join the two files in Windows and throw that at the script?

When I got off the bike, that’s exactly what I did.

And it worked.

So with that (finally) in the bag, I thought “Right, I’d better get that blog written”, which is what beings us to where we are.

Tomorrow I’ll start the task that I’d originally scheduled for Thursday morning: search through those 6.4 million rows of the Transitive Closure table in order to create the family tree of COVID-19. I plan on documenting it in Excel, with hyperlinks all over the place to glue everything together. In a strange kind of way, it kind of feels like my 49 years in programming has come down to this…

Transitive Closure.