Why hash tags are broken, and ideas for what to do instead.

I was at Moseley Bar Camp last Sunday and there were some great sessions. Andy Mabbett stood up to lead a discussion entitled Let’s Play Tag: recent developments and emerging issues in the use of tagging for added semantic richness.

Andy was looking for discussion on how to solve the problem of ambiguity in hash tags – a popular technique for categorising community tweets on twitter. His example is classic event tagging, the tag for the event was #mbcamp which works fine for the duration of a Sunday afternoon event, but what if you want tags to be more enduring?

Andy took us step-by-step through the issue of ambiguity of usernames as tags on twitter and flickr and described some of the issues of differing tag normalisation rules.

Andy also asked why we tag?

  • To add semantic richness?
  • To help your friends find stuff?
  • To help machines in 100 years find stuff
  • Don’t know

I tag for all of those reasons, but not on twitter. On twitter I use hash tags to contribute to an in the moment conversation that’s happening at a particular event or on something topical.

Andy’s issue, then, is with the value of these tags longer term and on more enduring stuff like blog posts, photos on flickr and so on. Perhaps 100 years might be pushing it, but it’s worth thinking about.

The problem with hash tags comes from the tension between finding something specific enough for the moment, something short enough to not use up too many of the 140 characters and something easy to remember. That’s two forces pulling one-way (shorter) and only one pulling the other.

The shorter the tag goes the easier it is to remember and to type, and the fewer character it uses up, but it also becomes more likely to clash with others. Perhaps some mainstream trends might get away with very short tags, I thought. #fb for example means facebook, surely, but looking at the use of it apparently the references to facebook are far outweighed by the noise.

So, twitter’s 140 character limit and the profusion of clients means we can only have short, easy to remember text tags, but the need for disambiguation and to be more specific means we need something longer.

We could solve the ambiguity problem by using something like a guid, but that’s not easy to remember or type, and is generally quite long. The length issue could be solved by encoding it using unicode characters. Twitter counts multi-byte UTF8 characters as single characters, which is correct, and this opens up some interesting unique tags for those willing to forego the easy typing.

By long I mean cf629dc3-d425-4707-8119-1f35d35d7687 which is a fairly typical GUID and is 36 character long. That’s too long if you only have 140 characters to play with. The length comes from the need to encode it as ASCII. Twitter, where our length obsession comes from, doesn’t require characters to be ASCII. The 140 character limit is for 140 UTF8 characters, so we can use a much greater range of characters to represent the same degree of uniqueness in a shorter UTF8 string.

UTF8 isn’t ideal as a starting point, though, as the number of bytes per character varies. The unicode definition uses nice simple 2 byte indexes, so we match 4 ASCII characters from the GUID to a unicode character, then use the UTF8 encoding for those to write it down. By using unicode and UTF8 it becomes just a handful of characters, just 8 for this GUID.

cf62 콢, 9dc3 鷃, d425 퐥, 4707 䜇, 8119 脙, 1f35 ἵ, d35d 퍝, 7687 皇

This gives us a tag of #콢鷃퐥䜇脙ἵ퍝皇 which is not easy to type, would be difficult for many to visually identify and could, for all I know, be extremely offensive to those who read CJK, Hangul or Greek. I may have got lucky with that GUID too, there may be GUIDs that don’t produce valid unicode pairs.

But, as it’s a GUID it gives a very high confidence that is unique, it’s only 8 characters long and works as a unicode tag on Flickr and a unicode tag on Hashtags. Just don’t look at the raw URLs in the source of the page…

What we lose with that approach is a good deal of ease-of-use. I certainly wouldn’t try this technique at an event.

If you’re prepared to lose a little usability, maybe giving people an easy place to grab a copy/paste version of the tag then you could produce something more easily readable, if not easy to type: #dɯɐɔqɯ for example. I might be tempted to do that, or add a graphic symbol or something.

There’s something else that nags at me about hash tags, though. They’re really not very webby. You rely on search and on hashtags.org and other specific tools to make sense of them. They can be easily abused, as Habitat showed recently.

So are there other ways to think about tagging? Ways that work with the web rather than just on the web. Examples from those applications where the 140 character limit does not apply? Blog posts, web pages, flickr images and so on?

What if we decided that our requirements for tagging were:

  1. A very high degree of uniqueness
  2. Anyone can get information about the tag easily
  3. Spam and content visible on the tag controlled by the tag owner
  4. That the tag can be enduring
  5. That the tag can be used anywhere on the web easily
  6. That content using the tag can be found with search
  7. That content using the tag can be found without search
  8. That no particular service or piece of software is necessary

In it’s essence, tagging is about saying this comment, blog post or image is about this event, concept, product etc. In the blogging world it’s very common to say this post is about the content in this other post. We do that through trackbacks and through simple links. Many blogs accept trackbacks and look at the referring page information so that they can provide links, alongside comments, to other posts referring to them.

A similar things happens with Google’s PageRank algorithm. Words used in links to a page, as well as the content of the referring page, contribute to the way a page is indexed.

The Semantic Web bases everything on URIs (the difference between URI and URL is not important here). If you want to give something a name you don’t pick a word, you use a URI.

I wonder if we could use URIs as tags? And how that would meet the needs above. Say we were to use http://wxwm.org.uk/moseleybarcamp/2009/June to mean the event that happened last weekend.

It has a very high degree of uniqueness, so it meets our first requirement. It can be put straight into a browser and can provide a page giving details of the event, so it’s easy for anyone to get information about the tag. The page at that address can be as clever, or as dumb, as it likes about showing things that link to it – so tag spam can be removed. The link is under control of the domain owner, so can be as enduring as you want to make it. Almost everywhere on the web allows you to post links, so it’s easy to use. Links to a specific URL can be easily searched for in Google and other search engines, and in Flickr and Twitter. Most browsers will send referring page information when requesting the URL, so content can be tracked without search – this means you can find out about unindexed and intranet sites referencing the tag. The URL can be a static page, or a script, it can monitor referrers and spam filter – or not. There is not centralised service needed nor any specific software.

Oh, and it could easily be made to work as Linked Data, the pattern for publishing data on the semantic web, to provide machine-readable information about the event and the conversation happening around it…

I think that only leaves the issue of URI length. I can’t get close to the 8 characters of the guid, or the 6 of mbcamp, but using bit.ly I can make a memorable short URL such as http://bit.ly/utf8tag that redirects to a much longer one, and as bit.ly don’t re-use URLs the bit.ly link remains as unique and almost as enduring (subject to bit.ly’s survival) as your own.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>