Posts

Showing posts from March, 2021

The importance of "small sanitisation"

I've been working on a Python script that takes the contents of Te Ara and puts them into a compressed archive, after being inspired by how the entirety of the English Wikipedia's text and the Simple English Wikipedia's text can be downloaded as 18GB and 201MB archives respectively (as at March 2021), and wanting to have my very own copy of Aotearoa's history. There were a variety of bugs along the way - I was missing calls to .replace() to handle some Unicode character conversions in my file path sanitisation function, and my browser and IDE were confusing me by showing the same representations for an em dash and a minus, which wasn't helping. I had implemented multiprocessing to take care of saving each article as a PDF once the sitemap had been scraped into a list and all was running well at the time - it was 4x faster and blazing its way through them. However, a seemingly small bug came up in the tail end of execution. It was giving a FileNotFou