Statusupdate CW27

Posted on Jul 12, 2021

Getting instant results from your own search index is pretty gratifying!

done

  • experiment on how to export/import our data between the indexer and Sphinx - for speed
  • automate exporting a our data to Sphinx
  • configure Sphinx to use our data for searching with our (naive) ranking
  • serve Sphinx results on HubGrep
  • ingest crawler data in bulk
  • add authentication (api-keys) for our crawler endpoints
  • make bugs, kill bugs, tests, cleanup, dinner, sleep, shower, brush teeth…

challenges and what we’re doing

The past 2 weeks have been a bit of a stopwatch session, and it will continue. Since now we are starting to see concretely how long things can take; from crawlers retrieving data from somewhere online, to our indexer bottle-necking on data insertion when we have too many crawlers coming back at once, to exporting our unified data before giving it to Sphinx. On the whole, we want to be able to crawl and index an external repository hoster once per day, at minimum. This is only a question for us regarding GitHub, since all the rest are minuscule in comparison. Right now it looks doable with what we have, but we’ve yet to actually do it so it remains to be seen. RAM is also a concern, since we’re currently running our indexer on a Raspberry Pi 4 (4GB RAM). Let’s avoid the “why” to that, and I’ll instead say that it looks like it’s going to be alright… so far. We have a ballpark number for the total data that we will have and the data structure it will take in our database; and unless something scales horribly un-linear that will be alright as well!

So that’s the early victory dance out of the way, coming up next is the always recurring reality checks of “things we didn’t consider”, bound to bite us in our collective ass. When that happens, it will now be by design.