What is France?

Stupid question, isn’t it. You can point to it on a map, it’s where croissants come from and you can even go to vacation there, verifying it’s a real place.

But in the context of evaluating data that refers to France (and quite a few other countries) there are a few things to consider.

Language vs. Country Link to heading

Depending on your organizational setup you might group units and departments according to language, which makes a lot of sense human resources wise. So, the team dedicated to managing France might also be responsible for the french-speaking parts of Belgium and Switzerland.

Now I can hear the slapping of foreheads, and you, dear readers, saying: “Why don’t you just use fr-BE, fr-CH and fr-FR?”. Let me answer like this:

  1. I do whenever I can
  2. Even large player like the Google Play store and the Apple App Store do not abide by this convention, with one accounting reviews by country and the other one by language.

This makes is very hard to map efforts and gains between the geographical locations and the organizational endeavors associated with them.

If it has a French Flag, is it France? Link to heading

There a some French overseas territories like Saint Martinique, Guadeloupe which are somewhat France. They do however not carry the country code of France and sorting them under France requires prior knowledge of this relationship.

Changing country names Link to heading

Countries changing names happens. Do not rely on them staying the same. Why does it even matter? Some sources (heavy side-eye Firebase Analytics) do use country names instead of ISO-codes, meaning you have to match the country names with country codes. Your data sources might also change the country names at different times. Also, some sources while not technically changing the country names for valid reasons like political or simple self-determined PR reasons, might still willy-nilly change the way of writing the same name like using “United States of America” one day and then “America, United States of” the next. Obviously, depending on the data source when it is using a country’s name instead of its ISO-code it might use any other language than English.

How to alleviate this issue? Link to heading

A non-complete solution to these issues is a translation table where one column contains a variation of the country’s name and another column contains the ISO2 code for this country. Then simply list all the different variations of all country names in the table. Pay extra care you get each variation only once.

Now, building this list is easier said than done. To be honest, I haven’t really found a great solution on how to build such a table and must admit that I build that table manually over the course of two days.

Then I established a process for when new data sources - and possibly new creative ways to spell a tiny country - are introduced into the system to make sure they are properly deduplicated and there is a Levensthein-distance matching with all the other recorded variations, eliminating the need for research.

There are obvious other countries where the same or similar issues are to be taken into account. Without being political and indeed somewhat dismissing post-colonial strife, it might make sense to define your markets according to similar cultural and linguistical environments.