Status Update CW21 - Integrating stuff!

Posted on May 25, 2021

After we have been working more or less separatly on differnet topics, we finally had a longish talk about how to integrate our crawlers to a backend that collects the crawled data, and how we could integrate that data with our HubGrep search, as a local search backend.

done

  • generating fake data to test, and learning sphinxsearch (still a long way to go)
  • thinking about the changes we have to make to search, when we want to add our own search index (see motivations/challenges)
  • more work on crawlers, talking about how to interface to a indexer backend for data collection
  • started working on an indexer

doing

  • getting an initial version of the indexer up, connect the first crawler
  • more crawlers!

motivations and challenges

tl;dr: There is a bigger topic coming up when we add an import from our (upcoming) indexer/crawler to HubGrep.

We are still working on separate crawlers for all kinds of code hosters, collecting the results in an indexer, so that we can publish the complete repository metadata of all hosters separately. Afterwards, we want to add an import function to HubGrep: a HubGrep admin should be able to bootstrap a new, self-updating local search index from our indexing service - so that we dont rely on (sometimes really slow) metasearch.

That changes our current workflow a lot: in theory, we dont need the users to add hosters to HubGrep instances, but to the indexer - but the admin still has to decide for hosters he wants to have indexed locally. (because, having a indexed copy of all of github could be huge) But we still want users to add new instances, so we probably forward users to the indexer, and let them add the instances there?