Crawling around on the web

We’ve been putting a certain amount of effort lately into thinking about how to recruit an army of hunter-gatherers, using the recent publication of the Experiences & Outcomes to be expected from the Curriculum for Excellence, as a sort of framework for thinking about this.

But it’s easy to forget the other (electronic) sort of hunter-gatherer. I start by imagining this a sort of super-fast student googler (a bit like my children only turbo-charged), endlessly running search queries. There is a fair amount of information already available on the web, for visiting, labelling and indexing so that it may be found easily by people whom we want to help. What organisations like Google do is deploy ‘web crawlers’ to undertake this sort of work – could we seek to do anything similar in the much more limited domain of resources potentially applicable to local help for people with Long-Term Conditions?

The preliminary answer is yes. The technology is certainly available, either publicly (open source) or from our friends at NES, who have a substantial licence for the use of ‘FAST’ search tools, and describe their vision for the use of this as an ‘NHS Google’.

What is equally important for ALISS is – I think – how this technology will be managed and governed. One doesn’t just let a web-crawler scuttle out the door and hope it comes back at some indeterminate point with something useful. Wikipedia has a useful primer on the web crawler topic, and it suggests that we’ll need to work out four key questions – and keep them under review:

  • which pages to seek out (selection);
  • when/how often to check these for updates (a re-visit polic);
  • how to avoid overloading websites we want to check, with a bombardment of queries (a.k.a. a ‘politeness policy’)
  • and (or more technical one) how to co-ordinate the work of more than one crawler – we are quite likely to set up distributed/devolved searches (parallelization).

So what might be the project implications of needing these policies? A couple spring immediately to mind:

  • We would for example need to form a consensus view of where our crawlers should look, and how often- which might be informed by consideration of the results from their searches;
  • we would need to ensure time was available to respond to queries from other web server administrators (who if they are doing their job properly will be keeping an eye out for crawlers) about what our intentions are etc.

The second task seems like an administrative thing?

Meantime, what about the first? We’re seeking to proceed on a collaborative basis, so I think it will be important to be transparent about this. Could the  necessary processes perhaps be framed as a series of useful learning exercises, perhaps for older secondary, or tertiary education students?

This entry was posted in Idea, maybe, Rough Notes. Bookmark the permalink.

Comments are closed.