Russian geocoding support in Nominatim
|Reported by:||Zibrnbernstein||Owned by:||geocoding@…|
As of now, Russian geocoding support in Nominatim is totally broken. I'm filing this meta-ticket to track progress on individual tickets and to gather relevant information.
The background is that I've tried to conduct a sociological study that involved computing coordinates for hundreds of thousands of addresses. For that, I planned to deploy a local Nominatim instance, but it turned out that for most of addresses it simply doesn't work. For now, I resort to using Yandex (Russia's #1 search engine) geocoding API that works like a charm, but is not suitable for bulk queries. Another point is that there are desktop applications being developed that use geocode-glib library (GNOME Maps, for example) that, in turn, uses Nominatim API inside.
The problem is that Russian addresses nomenclature is very diverse and informal. Here is a brief summary; if needed, I can create a wiki article on that.
1) The "street" term includes not only "улица" (a street proper), but also "переулок" (side-street), "проезд" (passage), "проспект" (avenue), "шоссе" (highway), "тупик" (cul-de-sac), "мост" (bridge), "площадь" (square) and some others. These are used in full or abbreviated form ("улица" -> "ул.", "проспект" -> "пр-т"), and can be both appended or prepended to the name. Sometimes, "Большой" (major) or "Малый" (minor) are the part of the name, and the word order is arbitrary. Thus, "Большой Ордынский пер." and "Ордынский Большой переулок" refer to the same. #4703
Examples: "ул. Арбат", "Красная Площадь", "Филиппов пер.", "Энтузиастов шоссе"
2) The building number nomenclature is also very diverse. Usually, there is a top-level prefix: "дом" (house) or "владение" (property), followed by the main number. These prefixes can be abbreviated as "д." оr "вл." or even omitted. #4647
Besides the main number, there can be also letter indexes, different sub-numbers and combinations of those:
- letter index is a letter (usually "а", "б", "в") appended to the building number without a space;
- sub-building is either a "строение" or "корпус". These are similar, but not interchangeable. These can be spelled full-form ("дом 1 строение 2)" or abbreviated in different ways: "д. 1 стр. 2", "д. 1с2", "3 корп. 1", "3к1". As you see, are short form ("стр. 3") and one-letter form ("с3"); both period and space can be omitted when appending it to the main number. Moreover, a sub-building number can have a letter index itself;
- finally, the slash syntax is used when the building has dual address. For example, a building on the corner of two streets can be addressed as both "ул. Малая Ордынка 30" and "Большой Ордынский пер., 6с1", while full address is "Малая Ордынка 30/6с1".
3) Rarely, but there can be ranges used as building numbers. For example, there is one single building with an address "Лесная ул., 10-16". This means that this building should be a hit for requests like "Лесная, 12" or "Лесная, 14" (but not "Лесная, 11" - there are even and odd sides of the street usually).
As a solution, I can imagine some code that canonicalizes the requested address. For this to work, all the Russian addresses in OSM will need to be canonicalized, too (probably, with the help of the same code).