Status Update CW19 2021 - Crawly slimy things

Posted on May 17, 2021

It feels like we’ve been in a lull for the past 2 weeks

, but that’s likely because our current tasks provide less feedback and preview compared to before when we worked on smaller steps with immediate results, often affecting the frontend. I’m telling myself it’s only a perceived lull anyway, until we get to plug things together and see the whole of it.

done

  • more planning than we had planned
  • research search engines (and decided to start with sphinx-search)
  • generate fake data to test search engine with
  • research ways to crawl GitHub, GitLab and Gitea

doing

  • implement crawlers
  • test sphinx-search with fake data

motivations and challenges

When crawling for GitHub repository meta-data, and this might not be too surprising - they don’t seem to be very interested in making it too easy for someone to extract their entire dataset. In their various API routes and entry-points they make sure to limit pagination to the first 1000 results if you use their search API in any form. So at 100 items per page you get 10 pages and then it stops.

Another route exposes repositories directly with 100 items per request, and no pagination limit - but the downside is that each and every data-point on these repositories are hidden behind another url. So if there are 74 mil. public repositories at github (number from the graphql API), you can multiply that with around 20 (counting low) for how many requests you need to send to crawl for your data. Combined with rate-limiting this really blows up the time needed for a crawler to cycle through the dataset.

So using the search API simply does not work. The second alternative which seems to be an intended way DOES work, though it’s very inefficient. Another working solution that we found was to query for users and getting repositories tied to these users. Though not crucially so, this provokes additional requests per repository, as well as having overlap between users sharing the same repositories.

Lastly, and what we use right now, is going via GitHubs graphql API and requesting repositories directly via IDs. This gives us 100 repositories per request, with the downside being that we have to guess at these IDs. Old deleted IDs are not reused which causes gaps in our incremental guessing, and at least for the first few thousands it seems that around 80% of IDs are missing. So we get around 20 repositories per request instead of 100, with hopefully a higher average over the whole dataset once we are through with a full cycle.

Of course, there might be other ways to do what we want, which we simply didn’t find. In fact, we tried to get tips from GitHub support on how best to do this, but were met with a response which seems as if they either didn’t understand our question or didn’t want to.

A quick disclaimer - we understand that someone with a lot of data may not want to expose a “download everything”-button. There can be many reasons, aside from the ones we might be guessing at.