Opened 3 years ago

Last modified 3 years ago

#5015 new defect

Russian geocoding support in Nominatim

Reported by: Zibrnbernstein Owned by: geocoding@…
Priority: major Milestone:
Component: nominatim Version:
Keywords: Cc:

Description

As of now, Russian geocoding support in Nominatim is totally broken. I'm filing this meta-ticket to track progress on individual tickets and to gather relevant information.

The background is that I've tried to conduct a sociological study that involved computing coordinates for hundreds of thousands of addresses. For that, I planned to deploy a local Nominatim instance, but it turned out that for most of addresses it simply doesn't work. For now, I resort to using Yandex (Russia's #1 search engine) geocoding API that works like a charm, but is not suitable for bulk queries. Another point is that there are desktop applications being developed that use geocode-glib library (GNOME Maps, for example) that, in turn, uses Nominatim API inside.

The problem is that Russian addresses nomenclature is very diverse and informal. Here is a brief summary; if needed, I can create a wiki article on that.

1) The "street" term includes not only "улица" (a street proper), but also "переулок" (side-street), "проезд" (passage), "проспект" (avenue), "шоссе" (highway), "тупик" (cul-de-sac), "мост" (bridge), "площадь" (square) and some others. These are used in full or abbreviated form ("улица" -> "ул.", "проспект" -> "пр-т"), and can be both appended or prepended to the name. Sometimes, "Большой" (major) or "Малый" (minor) are the part of the name, and the word order is arbitrary. Thus, "Большой Ордынский пер." and "Ордынский Большой переулок" refer to the same. #4703

Examples: "ул. Арбат", "Красная Площадь", "Филиппов пер.", "Энтузиастов шоссе"

2) The building number nomenclature is also very diverse. Usually, there is a top-level prefix: "дом" (house) or "владение" (property), followed by the main number. These prefixes can be abbreviated as "д." оr "вл." or even omitted. #4647

Besides the main number, there can be also letter indexes, different sub-numbers and combinations of those:

  • letter index is a letter (usually "а", "б", "в") appended to the building number without a space;
  • sub-building is either a "строение" or "корпус". These are similar, but not interchangeable. These can be spelled full-form ("дом 1 строение 2)" or abbreviated in different ways: "д. 1 стр. 2", "д. 1с2", "3 корп. 1", "3к1". As you see, are short form ("стр. 3") and one-letter form ("с3"); both period and space can be omitted when appending it to the main number. Moreover, a sub-building number can have a letter index itself;
  • finally, the slash syntax is used when the building has dual address. For example, a building on the corner of two streets can be addressed as both "ул. Малая Ордынка 30" and "Большой Ордынский пер., 6с1", while full address is "Малая Ордынка 30/6с1".

3) Rarely, but there can be ranges used as building numbers. For example, there is one single building with an address "Лесная ул., 10-16". This means that this building should be a hit for requests like "Лесная, 12" or "Лесная, 14" (but not "Лесная, 11" - there are even and odd sides of the street usually).

4) The "е" (ie) and "ё" (yo) letters should be treated as identical; the queries should be case insensitive. #2467 #4819 #2758

As a solution, I can imagine some code that canonicalizes the requested address. For this to work, all the Russian addresses in OSM will need to be canonicalized, too (probably, with the help of the same code).

Change History (3)

comment:1 follow-up: Changed 3 years ago by lonvia

Regarding 1): you can help with that by listing the frequent Russian abbreviations that are still missing in Nominatim. You'll find a list of the abbreviations currently used here in the code. The list has the abbreviations after transliteration but you don't need to worry about that, just list them in Cyrillic.

comment:2 Changed 3 years ago by Zibrnbernstein

In the code I've found the only abbreviation relevant to Russian ("ulitsa/ulica", meaning "street"). BTW, it would be much simpler to find abbreviations if they were grouped by languages/countries.

So, here we go, these are common terms and their abbreviations:

бульвар: бульв, бул, б-р
набережная: наб
переулок: пер
площадь: пл
проезд: пр, пр-д
проспект: пр-т
тупик: туп
улица: ул
шоссе: ш

There are also six adjectives (meaning "major/minor/old/new/upper/lower") that are commonly abbreviated when spelling street address (gender forms included):

большой, большая, большое, большие: б, бол
малый, малая, малое, малые: м, мал
старый, старая, старое, старые: ст
новый, новая, новое, новые: нов
верхний, верхняя, верхнее, верхние: верх
нижний, нижняя, нижнее, нижние: ниж, нижн

Regarding house/building numbers, the following terms are in use:

дом: д
владение: вл
строение: стр, с
корпус: корп, к

Last edited 3 years ago by Zibrnbernstein (previous) (diff)

comment:3 in reply to: ↑ 1 Changed 3 years ago by Zibrnbernstein

Replying to lonvia:

As for 2) and 3), I've just created a simple but working ANTLR grammar to parse house numbers given in the described form. I can share it if you consider it useful. Currently, ANTLR generates Java and C# parsers; I can create C/C++ parser if needed.

Note: See TracTickets for help on using tickets.