7 reasons why I don't like content 'aggregators' who scrape blog sites

Today a post on twitter drew my attention to Bioinfo-Bloggers, a site that aggregates content — i.e. the full blog post is reproduced — from 28 different bloggers who write about bioinformatics and genomics.

Outwardly, this might seem like a good idea. The bloggers get more exposure to their material, and readers can visit just one site instead of 28 separate RSS feeds. However, there are several reasons why I have issues with this type of aggregation. Many of my concerns apply even when individual bloggers have expressly licensed their material for reuse (e.g. by use of a CC0 Creative Commons license).

  1. The site lists the 28 blogs as 'contributors' and lists the blog writers as 'authors'. This strongly suggests that the people in question have consented to their material being used, even when this is not the case.
  2. Links to the original blog posts are included, but only at the end of each reproduced entry. The included text says that 'This is a syndicated post', further suggesting that the original authors agreed to have their content syndicated.
  3. The Bioinfo-Bloggers website asserts copyright over all material (see footer section of website).
  4. The original bloggers lose web traffic. This can matter for minor reasons such as when you want to include details of how popular your blog is for outreach sections on research grants. But it potentially — depending on how much traffic Bioinfo-bloggers gets — deprives you of knowing who is looking at your content, which articles are more popular, etc.
  5. People don't a chance to comment on your blog (unless they follow the links). You may lose some direct engagement with your readers.
  6. If people start using this site rather than viewing your blog, what happens if Bioinfo-Bloggers stops including your blog site, or shuts down altogether? In the former case, people might just assume you are not posting any more.
  7. What happens if Bioinfo-Bloggers starts including content from other blogs that you don't approve of? Your blog post may appear alongside another which espouses views you find offensive.

The first three points could easily be addressed by removing the claim of copyright over all material, by making it explicit that this site is just scraping other sites and that the original bloggers may not be aware of this, and by placing links to the original blog content at the top (not bottom) of each article.

There are currently some ongoing discussions about this on Twitter. E.g.

What's in a name? Better vocabularies = better bioinformatics?

About 7:00 this morning I was somewhat relieved because my scheduled lab talk had been postponed (my boss was not around). But we were still having the lab meeting anyway.

About 8:00 this morning, I stumbled across this blog post by @biomickwatson on twitter. I really enjoyed the post and thought I would mention in in the lab meeting. Suddently though that prompted me to think about some other topics relating to Mick's blog post.

Before I knew it, I had made about 30 slides and ended up speaking for most of the lab meeting. I thought I'd add some notes and post the talk on SlideShare.

I get very frustrated by people who rely heavily on GO term analysis, without having a good understanding of what Gene Ontology terms are, or how they get assigned to database objects. There are too many published anayses which see an enrichment of a particular GO term as some reliable indicator that there is a difference in datasets X & Y. Do they ever check to see how these GO terms were assigned? No.