Samstag, 6. September 2014

Visualizing OSM data with cartodb to aid HOTOSM validation

[This post has been moved to: http://blog.spatialbits.de/post/vizualizing-osm-data-w-cartodb/]

Playing around with cartodb has been on my list for a while now. Also I started to contribute to HOTOSM lately. During mapping and validation work for the Ebola related HOTOSM tasks I noticed that in some areas the relevant features are not mapped as expected. Presumably unexperienced mappers map e.g.  buildings as single nodes and/or don't apply the highway tag guidelines for Africa correctly. As some enthusiastic mappers may work on sparse areas in a short time its difficult track down mistakes and notify the contributor early.
While josm offers flexible filter functionalities I find it hard to get a flexible overview of a wider area.

Hence I wanted to find out how if and how the visualization features of cartodb could be useful here.

So I signed up for a free test account at cartodb (50 MB and 5 table included) and downloaded the OSM data for Sierra Leone from Geofabrik.
While it is possible with cartodb to import OSM data directly and extract relevant data using SQL (postgis) queries, I imported the data into a locale database and ran some queries to create three tables (csv) containing the following features:

  1. all single nodes that are tagged as buildings
  2. all buildings that that are not tagged with 'yes' (centroid points of the polygons)
  3. all start points of highways that are not tagged according to the guidelines 
To make it easy to evaluate single features in josm I wrote a python script to add a column in the csv table that contains a link that downloads the feature in josm, using the remote plugin. 

Uploading the csv to cartodb and creating visualizations is easy then, resulting in the following map:



The first layer ('single nodes buildings') shows some clusters where buildings are mapped as single nodes, e.g. just south of 'Bo', an area that was mapped as part of the HOTOSM task #605.

The second layer ('building != yes') shows all centroids of buildings and the color indicates the tag. Some clusters, e.g. where buildings are tagged as 'house' become obvious.

The third layer then shows the starting nodes of highways and their related tags though the color, showing e.g. a lot of 'footways'.

Cartodb is easy to use and leads to quick visualization results, revealing areas with (potential) mapping/quality issues to further evaluate.
I just set this up as an experiment. As the OSM map evolves the pictures will change and hence my map will out date. However cartodb offers synchronization with data sources. So it would be rather easy to implement a workflow that creates e.g. a daily picture of a given area and a focus on specific validation issues.   



Dienstag, 17. Juni 2014

Text Mining INSPIRE Conference Contributions

[This post has been moved to: http://blog.spatialbits.de/post/inspire2014/]

So the INSPIRE Conference 2014 (#inspire_eu2014) starts tomorrow - after two days of intensive workshops.
For me this poses the challenge to decide which of the parallel sessions I should attend to. As I have been experimenting with the R framework lately I decided to make use of some text mining techniques instead of reading through all the abstracts to get an idea about hot topics, trends and potentially interesting sessions.

Here are some of my 'results'. More on the methodology below.


To get a first impression I take a  look at terms that appear frequently (15+) in the contribution's titles:


And the same for terms in the abstracts (150+)
Also from the abstracts a nicer looking wordcloud (100+):
Now I'd like to identify contributions that deal with topics of interest (e.g. "benefits" (2+), "health" (1+) or "metadata" (5+)):


Taken the 'contribution ID' (just the number) I can access the full abstract:

http://inspire.ec.europa.eu/events/conferences/inspire_2014/schedule/submissions/<ID>.html

Besides that the tm-package offers a lot of functionality to analyse the datasets further. For example I can identify terms that are correlated to a specific term. For instance terms that are correlated (0.5+) with "wfs" (considering all abstracs) are:



So a few words on what I did.

Getting ready:
  • download the abstracts (wget)
  • removing headlines, html, blank lines, line breaks (sed, tr)
  • extracting abstracts, titles (sed)
r-project/tm
  • convert all characters to lower-case
  • remove numbers, punctuation, whitespaces
  • remove URLs 
  • remove stopwords
  • apply word stemming 
  • apply stemcompletion
Now the set of documents is ready to run the analyses.