Thursday, June 3. 2010
This time I needed some geographical data evaluation for one of my current projects. The data used to be opensource and as possible qualitative and complete. After spending some time in front of the famous search engine window there was a solution:
That's the first sentence from the site:
The GeoNames geographical database covers all countries and contains over eight million placenames that are available for download free of charge.
Geonames is used by number of sites. Besides data it has also web services and code libraries in many programming languages to access them.
Direct web services are such a great idea, but an objection I've met for them was a question: "what happens when the service is unavaliable?". So I did just weight up pros and cons which could affect my current project.
Using Geonames web services
- no stress for the local system to perform db requests
- no need for data synchronization
- if the service gets unavailable or , it could make the rest of the app unusable
- we've got only data delivered by service
Using Geonames imported data
- possibility of the data availability is definitely high
- some data could be corrected locally or missing data could be added locally (I don't say that's nice)
- data amount is relatively high (about 1gb+) on average ext3 partition
- additional effort for data and deltas import
That's it, at the end of the day I've decided for much application stability and for local data assess all the more we have enough space on the server.
First lets start with database design, because the data we will get from the Geonames is a simple CSV format. Some approaches for MySQL can be found on geonames forums but in my case it has to be Postgres. Therefore doing it from scratch took a while to define resonable field definitions.
DROP TABLE IF EXISTS iso_languagecodes CASCADE; DROP TABLE IF EXISTS geoname CASCADE; DROP TABLE IF EXISTS countryinfo CASCADE; DROP TABLE IF EXISTS geoname_geonameid_key; DROP TABLE IF EXISTS iso_languagecodes_iso_639_1_key; DROP TABLE IF EXISTS countryinfo_key; CREATE TABLE geoname ( geonameid INTEGER UNIQUE, name VARCHAR(200), asciiname VARCHAR(200), alternatenames VARCHAR(5000), latitude DECIMAL NOT NULL, longtitude DECIMAL NOT NULL, feature_class CHAR(1), feature_code VARCHAR(10), country_code CHAR(2), cc2 VARCHAR(60), admin1_code VARCHAR(20), admin2_code VARCHAR(80), admin3_code VARCHAR(20), admin4_code VARCHAR(20), population BIGINT, elevation INTEGER, gtopo30 INTEGER, timezone VARCHAR(100), modification_date DATE ); CREATE TABLE iso_languagecodes ( iso_639_3 CHAR(3), iso_639_2 VARCHAR(10), iso_639_1 CHAR(2) UNIQUE, language_name VARCHAR(200) ); CREATE TABLE countryinfo ( iso CHAR(2), iso3 CHAR(3), iso_numeric INTEGER, fips VARCHAR(3), name VARCHAR(256), capital VARCHAR(256), area NUMERIC(32,8), population INTEGER, continent CHAR(2), tld CHAR(8), currencyCode CHAR(3), currencyName CHAR(20), Phone CHAR(32), postalCodeFormat CHAR(128), postalCodeRegex CHAR(256), languages VARCHAR(256), geonameId INTEGER UNIQUE, neighbours CHAR(256), equivalentFipsCode CHAR(16) );
Now is time for the script which will do clean import of the current Geonames data. I would only remark here, that the way psql commands are working in this script needs local assess from trusted user to be configured in Postgres.
#!/bin/sh # base url we get files from BASE_URL='http://download.geonames.org/export/dump/' # define files we wanna to get FILE_LIST="allCountries.zip readme.txt countryInfo.txt iso-languagecodes.txt" # pg database definitions DB_NAME=siteapp DB_CMD="psql -h localhost -d $DB_NAME" DB_IMPORT_STATEMENT="COPY %s FROM STDIN WITH DELIMITER AS E'\t' NULL AS ''" # getting current date TODAY_DATE=`date +%F` #create download dir based on the current date mkdir -p "$TODAY_DATE" && cd "$TODAY_DATE" # download files for FILE in $FILE_LIST do wget -c "$BASE_URL/$FILE" done # drop and create the table structure $DB_CMD < ../pg_struct.sql # unpack and import iso language codes into db DATAFILE=iso-languagecodes.txt LINE_COUNT=`cat $DATAFILE | wc -l`; let "LINE_COUNT=$LINE_COUNT-1" tail -$LINE_COUNT $DATAFILE | tr -d '\r' | $DB_CMD \ -c"`printf \"$DB_IMPORT_STATEMENT\" iso_languagecodes`" # unpack and import all data into db DATAFILE=allCountries.txt rm -fr $DATAFILE unzip allCountries.zip $DATAFILE cat $DATAFILE | tr -d '\r' | $DB_CMD -c"`printf \"$DB_IMPORT_STATEMENT\" geoname`" rm $DATAFILE # unpack and import country info into db DATAFILE=countryInfo.txt tail -248 $DATAFILE | tr -d '\r' | $DB_CMD -c"`printf \"$DB_IMPORT_STATEMENT\" countryinfo`" cd .. exit 0
The next script is needed to perform daily updates because Geonames issue updates daily. In fact this script should be run as a cronjob.
#!/bin/sh # make yesterday date YESTERDAY=`date --date='1 day ago' +%F` # define delta filenames MODS_FILE="modifications-$YESTERDAY.txt" DELS_FILE="deletes-$YESTERDAY.txt" FILE_LIST="$MODS_FILE $DELS_FILE" # the base url we get the data files from BASE_URL='http://download.geonames.org/export/dump/' # database options DB_NAME=siteapp DB_CMD="psql -h localhost -d $DB_NAME" DB_IMPORT_STATEMENT="COPY %s FROM STDIN WITH DELIMITER AS E'\t' NULL AS ''" # generate today date TODAY_DATE=`date +%F` # make dir for data download mkdir -p "$TODAY_DATE" && cd "$TODAY_DATE" # download required files for FILE in $FILE_LIST do wget -c "$BASE_URL/$FILE" done # first delete deletes cat $DELS_FILE | cut -f1 | while read ID do $DB_CMD -c"DELETE FROM geoname WHERE geonameid = $ID" done # first delete the changed records cat $MODS_FILE | cut -f1 | while read ID do $DB_CMD -c"DELETE FROM geoname WHERE geonameid = $ID" done # then import all the changes into db cat $MODS_FILE | tr -d '\r' | $DB_CMD -c"`printf \"$DB_IMPORT_STATEMENT\" geoname`" cd .. exit 0
Note that I'm not importing all the data files provided but the full database. If you need more data such timezones or alternative names available from Geonames just extend SQL and bash scripts.
Now enjoy and start using Geonames data in your project
Kind of Regards
Display comments as (Linear | Threaded)
I need help, i want to obtain de addres of a given geoname object. I'v downloaded de data base and i have it in my laptop, but i need the addres of each one of those points.
Can you help me?
thank you for this helpful "how-to".
So I need for have only a list of continent -> state -> region -> province -> city
All others datas (such as latitude, population, etc...) are needless.
Please, can you help me?
I think you can import the country specific database or AllCountries database. The second one is much bigger than individual countries database. If you want to use only some countries then importing individually will be enough. For Regions notice that it's coded under featuredCode as RGN, for cities and towns usually it's as PPL. More info here: http://weblatam.com/wp/base-de-datos-de-localidades-del-mundo/