Random Etc. Notes to self. Work, play, and the rest.

Posts Tagged ‘Data’

Dataviz Salon SF #1

Michael Driscoll of Dataspora invited a few people to a Dataviz Salon yesterday evening. Mike and I went along and huddled in a brick-built basement in SoMa to listen to the following:

Thanks to Michael for putting on a great event and getting everything together at such short notice. Hopefully there'll be another one soon!

Airport Scheduling

It turns out Oakland airport has a form to fill in if you want flight schedule information. I haven't tried it yet, so I'm not sure if they'll respond to casual interest, but it's nice to know they're accessible.

They also have an interesting PDF talking about how to interpret the data. Heathrow had nothing of the sort when I worked with their schedules at my last job. It was more a combination of hearsay, logic and rules of thumb to predict gate assignment there. Good stuff.

A quick (less certain) note on using Lucene in Processing

In the spirit of continuing our impromptu database week on Processing Blogs (my post, toxi's post, then Florian Jennet updating the sql library), I thought I'd post another quick example using Lucene.

Last week Ryan and I needed a reliable way to search inside a data set we were working with. I had previously tried and failed to write my own useful search routine for the same data, so I wanted to take a look at Lucene instead. (This week, I might have used SQLite, but I hadn't tried it last week!).

Lucene isn't a relational database like MySQL or SQLite, although it has a few similarities with the way most database engines speed up queries using indexes. That's because Lucene is "just" the indexing part. You tell Lucene about your data, one part at a time, and then construct queries to ask it which parts of your data match the query. The key thing is that you keep hold of your data yourself and structure it any way you like, Lucene keeps its own representation of the data for searching. Because of this it can index much more data than you can store in memory.

Anyway, I put up an simple example Lucene applet here that indexes the text of Time Machine by H.G. Wells from Project Gutenberg and lets you type queries against it. The text itself and the Lucene index it builds are both quite small (tens of KB, compressed), but the applet is around 750KB. This is because Lucene's core jar file is about 500KB so it's more suited to standalone projects and applications.

The code is kind of documented, but Lucene was really too much for me to understand properly in just one day. Nevertheless, again, I hope people find it useful!

A Quick Note on using SQLite in Processing

SQLite is a great standalone SQL database engine - not ideal for every situation (particularly large websites), but more than good enough to have already made its way into desktop projects from Apple, and more recently Adobe and Google.

My colleague Mike recently used it as a way to distribute data from a 511.org transit website scraping project he's been doing. I wanted to see if I could download his data (gathered and processed using python) and access it using Processing.

Web searches for SQLite and Java (java wrapper sqlite, etc.) turn up lots of matches, including this promising tutorial from Tim Anderson, but sadly the most prominent matches didn't work for me. It's a real pain to get up and running and installed in a reliable way, mainly due to the need to compile SQLite natively for each platform and then talk to the native code using Java.

Enter this amazing project, a JDBC library for SQLite written in pure java that uses NestedVM to compile the C code for SQLite into something any Java VM can use. It's frightening to think about the amount of misdirection, abstraction and interpretation going on here, but it worked first time and plenty fast enough for my purposes.

Take a look after the jump for the code I ended up with. I hope people find it useful:

(more...)

Travel Time Tube Map

I've finally had time to get my Travel Time Tube Map applet to a presentable stage.

Here are a couple of screen shots to compare with Oskar and Rod.

There's a list of desired improvements on the applet page, but the next step for me is plotting this information on the Harry Beck style diagram rather than a geographic map. If anyone knows of a vector format tube map I could use to get me started, please let me know.

Maps, the Tube, and RDF headaches

Snowbound and carefree, I'm playing around with different methods of presentation for the ubiquitous London Underground (tube) map.

I found RGB versions of the tube line colours over at Rodcorp, so that saved me some bother.

Being the only machine-readable single-file resource I could find, I'm using Jo Walsh's RDF representation of the station locations and connections, a leftover from the sadly defunct MudLondon. I'm not sure yet if it's up to date, or complete, or internally consistent (reason number 1 in an ongoing series of why the semantic web might not be all that it's cooked up to be). Once I fix my doubtlessly buggy RDF parser and check with Jo for any pointers, I'll see if I can do better than this:

At least it's a start. If anyone has suggestions/links for alternative data sources I'd be very grateful - is there an electronic format schematic for the tube available from an official source? Aha - these CSV files from Wikipedia look more promising!

On with some alternative representations of the data - next stop: time to travel.

Half-baked thoughts on Social Network Visualisation for Flickr

Sites like Flickr (Friendster, Orkut, Tribe, MySpace, etc.) are collecting masses of data about people. Person A knows Person B, B knows C and D, D knows E, who knows A, and so on. If you're into your discrete maths (or your systems thinking), you'll know that the formalisation of these kinds of connections is called Graph Theory. If not, then bear in mind that in graph theory a Graph is composed of Nodes (or points) and Edges (or connections). If A,B,C,D,E are nodes, then "A knows B" is an edge between A and B. (Or if Flickr is a pair of shoes, A,B,C,D,E are the eyelets, and "A knows B" is a shoelace.)

My first example can be drawn in 2D like so:

Plotting these relationships in 2D, or 3D space is known as graph embedding, and finding useful ways to interrogate them is an intriguing problem to think about. Indeed, it's a research field in its own right. Graphs are well understood data structures and many tools are available to manipulate and analyse them (e.g. GraphViz). One reason for this is that they are the cornerstone of many computer science problems (and solutions), but also because they can be used across many different disciplines. The ubiquitous application of graph theoretical methods across many scientific disciplines is the main subject of recent popular science books such as Linked and Ubiquity.

Social Network Visualisation tends to borrow heavily, if not totally, from graph theory. Having looked at several examples of this over the past year, I've spotted some common pitfalls which I'll try and articulate here.

Graph Visualisation Pitfalls

On Flickr, we can easily find out if A knows B by looking to see if A has listed B as a contact. But listing of contacts is tied up with all sorts of other practical considerations. The main reason A adds B as a contact isn't so that we can use that data for social network analysis, unfortunately. On Flickr, it says A knows B, or A likes B's photos, or A chatted to B in the forums and might want to find them again, or B added A (for any of these reasons) so A was being polite and reciprocated, and so on. Not all connections are made equal.

In reality A might have anywhere from 0 to 500 contacts. As the average number of contacts creeps upwards, the naive attempt to draw the graph falls over. My ASCII art attempt would be screwed as soon as there was a group of five mutual acquaintences. Even with the assistance of mature software like GraphViz, the network of X knows Y is too dense to draw clearly. The not-a-tree problem is faced by many network visualisations. There are probably too many connections to graph.

For many of us, it's intuitive to visualise these relationships as a tree. This cluster is connected to that cluster, and people are either in one cluster or another. Clusters probably have sub-clusters. We can handle these kind of relationships easily, but unfortunately they fall down straight away when presented with real data. For example, grouping people by handed-ness is a trivial example of something which generates overlapping sets. Imagine A and B are left-handed, D and E are right-handed, but C is ambidextrous. We aren't dealing with heirarchical clusters, we're dealing with overlapping sets.

A knows B, B knows C, C knows D. What's the connection between A and D? If we're analysing a terrorist network, then we might have found a potential link which is worth investigating further. But if we're trying to recommend photography or music or web-links to A, should we include D's tastes? If D is C's drug dealer, and B is C's little sister, and A is C's elderly next door neighbour? Probably not. Connections aren't transitive.

Show Me The Eye Candy Anyway

So assume we're aware of the above caveats, and we have a densely but ambiguously connected graph of contacts. The main task with this kind of visualisation, once you have the data, is how to display it in a readable format. It's almost certainly too much data to just throw out there (but it's always worth a try), so how do we prune it down to show only the meaningful stuff?

On Flickr, GustavoG has been busy producing graphs of the mutual contacts and testimonials networks. You can see them all in his FlickrLand set. These are interesting to examine for many reasons, not least in how he has pruned the network in order to get manageable visualisations from it. The whole social network on Flickr would be too big to show in detail, so Gustavo doesn't show it all. He rightly spots that testimonials should indicate fairly strong ties, and the network is much sparser than the contacts network. He's also attempts to trim the constacts network down by set the requirement that contacts must be mutual for a graph edge to exist, and he's tried different thresholds for how many mutual contacts a person must have before they are added to the graph.

Gustavo has used yEd, a Java graph layout program, to produce graphs using the "organic layout", and it works pretty well. In particular, in certain graphs there are undeniably meaningful clusters to be found, ones that expert users can spot straight away. At 50 mutual contacts and 100 mutual contacts in particular, the clusters are pronounced. At 10 mutual contacts the network is too dense to be very meaningful - certainly as it's presented at the moment it doesn't say much at all. At 200 mutual contacts, it tells us what many regulars to Flickr already know - there are only a few very well connected folks, and they mostly know each other. Because testimonials indicate stronger ties, the overall network is very fragmented, leaving many loose mini-clusters. Nevertheless, the overall picture is surprisingly well knitted together.

FlickrGraph

Gustavo's method for making the data manageable is to remove nodes which aren't significant in the overall picture. For instance, in mapping out clusters of users on Flickr, it's probably not a big deal to lose people with less than 10 mutual contacts. Marcos Weskamp recently used a different tactic to cut down the same data - remove edges, and only show a subset (window) of nodes at any one time.

Marcos's FlickrGraph is fantastic to play with - the interactivity and clean design help there - but unfortunately it is less meaningful than Gustavo's static graphs.

The FlickGraph suffers from the data source because the current Flickr API will always return contacts sorted by user-id (an essentially arbitrary number). Because of this, and the limit of ten contacts shown, a lot of popular people won't appear to be connected to anyone despite the fact that they are.

Ways To Improve These Visualisations

Wanted: Richer Data.

We already know how to get meaningful visualisations from our data. We have to start with meaningful data. Marcos Weskamp has demonstrated a neat way of graphing mailing list interactions with his Social Circles project. In my opinion, the FlickrGraph lacks some of the insight that Social Circles provides. This is partly due to technical implementation, but mainly because Social Circles is sourced from real interactions and implicit connections, not watered-down explicit declarations of interest. Flickr users like HyperBob are very active in the forums and comments, but don't keep contacts on Flickr. Social Circles-style data would capture that.

On Flickr, there are several implicit contacts networks we could use:

The idea here is that we should be visualising the actual social things people are doing, rather than the social acts they say they are doing. I also touched upon these ideas in a post about Architects, Social Networks and Hypertext. But enough of my ranting, I'm off to try one of these methods.

Plasticbag Visualisation

Last week Tom Coates asked for people to have a go at visualising his 5 years of weblog posts. Looking for any excuse to help with the beta-testing of the Processing environment, I gladly obliged.

Today it turns out that apart from Cal Henderson (a close friend of Tom's, who used the opportunity to take the piss a bit) nobody else contributed. How embarrassing! Nevertheless, it was an interesting diversion for a couple of hours. For an explanation of what I came up with, see Tom's analysis.

Anyway, I'll post the code here once there's a public release of Processing with which it will work. Until then, if you want to know more - or if you think I got it wrong - you can email me tom(at)tom(dash)carden(dot)co(dot)uk.

Metadata for the Masses

Over at Adaptive Path, Peter Merholz is talking about bottom-up classification systems like the tags used to organise data on del.icio.us and Flickr. If you're not familiar with these sites - they deal with internet bookmarks and photos, respectively - one of the main features is the ability to add multiple tags (like keywords), to your data, so that you can find things easily. The tags are entered as free text, so there's very little effort involved in adding them, and they aren't intended to be complete or unique.

In his article, Merholz brings the term ethnoclassification to our attention - defined as "how people classify and categorise the world around them" - and compares the use of free tagging systems to the landscape designer's use of "desire lines" to place paving (see On The Beaten Path for a good look at emergent paths) .

He also speculates about where these bottom-up classifications are headed next,

Use the tags to understand how people consider the content at hand. Then you can “pave” the best paths to ensure findability — say, by explicitly linking “nyc,” “newyork,” and “newyorkcity.” You can also align these tags with more formal schemes, thus enhancing the utility of both.

This raises some interesting issues, not least of which is the fact that Joshua Schachter over at del.icio.us and Stuart Butterfield over at Flickr seem hostile towards anything which might be seen as an attempt to standardise tagging systems. Merholz isn't suggesting standardisation here, but it's easy to get onto a slippery slope. Once we realise "nyc", "newyork", and "newyorkcity" are similar then the temptation is to merge them, but for all we know, the distinction may be important to some users. The solution is to offer browsing of multiple tags as if they were one (a union of tags) as an optional view of the data.

This is why the emergent paths comparison is a good one, especially in the case of del.icio.us where it's easy to see how similar tags could be suggested through usage, because different people will be adding and tagging the same URLs. In the case of Flickr though, tag consensus will be harder to reach unless tagging is opened up to everyone, perhaps to tag their collections of favourites. That way, when people search for a particular tag, Flickr could use the favourites tags to offer related tag suggestions. Because it is an optional query refinement rather than a unification of terms, it then becomes an interface issue and not a complex and unwanted database normalisation task. Over at del.icio.us, Joshua is already experimenting with user/tag similarity suggestions, hopefully Flickr will soon.