Skip to content
This repository has been archived by the owner on Jul 24, 2021. It is now read-only.

Russian geocoding support in Nominatim #5015

Open
openstreetmap-trac opened this issue Jul 23, 2021 · 3 comments
Open

Russian geocoding support in Nominatim #5015

openstreetmap-trac opened this issue Jul 23, 2021 · 3 comments

Comments

@openstreetmap-trac
Copy link

Reporter: Zibrnbernstein
[Submitted to the original trac issue database at 3.45pm, Friday, 18th October 2013]

As of now, Russian geocoding support in Nominatim is totally broken. I'm filing this meta-ticket to track progress on individual tickets and to gather relevant information.

The background is that I've tried to conduct a sociological study that involved computing coordinates for hundreds of thousands of addresses. For that, I planned to deploy a local Nominatim instance, but it turned out that for most of addresses it simply doesn't work. For now, I resort to using Yandex (Russia's #1 search engine) geocoding API that works like a charm, but is not suitable for bulk queries. Another point is that there are desktop applications being developed that use geocode-glib library (GNOME Maps, for example) that, in turn, uses Nominatim API inside.

The problem is that Russian addresses nomenclature is very diverse and informal. Here is a brief summary; if needed, I can create a wiki article on that.

  1. The "street" term includes not only "" (a street proper), but also "" (side-street), "" (passage), "" (avenue), "" (highway), "" (cul-de-sac), "" (bridge), "" (square) and some others. These are used in full or abbreviated form ("" -> ".", "" -> "-"), and can be both appended or prepended to the name. Sometimes, "" (major) or "" (minor) are the part of the name, and the word order is arbitrary. Thus, " ." and " " refer to the same. when searching for Russian street names, transposition of the word "street" and the distinctive name should be allowed #4703

Examples: ". ", " ", " .", " "

  1. The building number nomenclature is also very diverse. Usually, there is a top-level prefix: "" (house) or "" (property), followed by the main number. These prefixes can be abbreviated as "." r "." or even omitted. search format for houses and buildings (and korpus) in Russian #4647

Besides the main number, there can be also letter indexes, different sub-numbers and combinations of those:

  • letter index is a letter (usually "", "", "") appended to the building number without a space;
  • sub-building is either a "" or "". These are similar, but not interchangeable. These can be spelled full-form (" 1 2)" or abbreviated in different ways: ". 1 . 2", ". 12", "3 . 1", "31". As you see, are short form (". 3") and one-letter form ("3"); both period and space can be omitted when appending it to the main number. Moreover, a sub-building number can have a letter index itself;
  • finally, the slash syntax is used when the building has dual address. For example, a building on the corner of two streets can be addressed as both ". 30" and " ., 61", while full address is " 30/61".
  1. Rarely, but there can be ranges used as building numbers. For example, there is one single building with an address " ., 10-16". This means that this building should be a hit for requests like ", 12" or ", 14" (but not ", 11" - there are even and odd sides of the street usually).

  2. The "" (ie) and "" (yo) letters should be treated as identical; the queries should be case insensitive. search must mach both russian ie and yo insensitive to them #2467 Russian "е" and "Ñ�" in names #4819 Search street is case sensitive in Russian #2758

As a solution, I can imagine some code that canonicalizes the requested address. For this to work, all the Russian addresses in OSM will need to be canonicalized, too (probably, with the help of the same code).

@openstreetmap-trac
Copy link
Author

Author: lonvia
[Added to the original trac issue at 6.04am, Sunday, 20th October 2013]

Regarding 1): you can help with that by listing the frequent Russian abbreviations that are still missing in Nominatim. You'll find a list of the abbreviations currently used [https://github.com/twain47/Nominatim/blob/master/module/tokenstringreplacements.inc here in the code]. The list has the abbreviations after transliteration but you don't need to worry about that, just list them in Cyrillic.

@openstreetmap-trac
Copy link
Author

Author: Zibrnbernstein
[Added to the original trac issue at 1.52am, Monday, 21st October 2013]

In the code I've found the only abbreviation relevant to Russian ("ulitsa/ulica", meaning "street"). BTW, it would be much simpler to find abbreviations if they were grouped by languages/countries.

So, here we go, these are common terms and their abbreviations:

: , , -[[BR]]
: [[BR]]
: [[BR]]
: [[BR]]
: , -[[BR]]
: -[[BR]]
: [[BR]]
: [[BR]]
: [[BR]]

There are also six adjectives (meaning "major/minor/old/new/upper/lower") that are commonly abbreviated when spelling street address (gender forms included):

, , , : , [[BR]]
, , , : , [[BR]]
, , , : [[BR]]
, , , : [[BR]]
, , , : [[BR]]
, , , : , [[BR]]

Regarding house/building numbers, the following terms are in use:

: [[BR]]
: [[BR]]
: , [[BR]]
: , [[BR]]

@openstreetmap-trac
Copy link
Author

Author: Zibrnbernstein
[Added to the original trac issue at 1.56am, Monday, 21st October 2013]

Replying to [comment:1 lonvia]:

As for 2) and 3), I've just created a simple but working ANTLR grammar to parse house numbers given in the described form. I can share it if you consider it useful. Currently, ANTLR generates Java and C# parsers; I can create C/C++ parser if needed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant