Opened 7 years ago

Last modified 7 years ago

#4127 new defect

non-ascii UTF-8 symbols in GPX traces names are converted to '_' on upload

Reported by: one_half_3544 Owned by: rails-dev@…
Priority: minor Milestone:
Component: website Version:
Keywords: trace, gpx, utf-8, localization Cc:

Description

I'm uploading a couple of gpx traces with russian names. All those symbols are converted to underscores on upload, and I have to duplicate it in description (where utf-8 chars seem to be ok).
Should be no problem support utf-8 in filenames, I guess.

Change History (6)

comment:1 Changed 7 years ago by TomH

I think you might guess wrong...

I doubt we are doing anything to the name, so most likely this is an issue with how the browser sends the name and how rails interprets what it receives. Character encoding of form data is quite a sticky issue as there is no clear way for the browser to indicate what encoding it has used for the data it sends.

There is an extra complication when sending a file that the browser may not even know what the encoding of the filename is. On windows it's not an issue but on any unix system there is generally no way to know what the encoding is for a filename.

comment:2 Changed 7 years ago by one_half_3544

Hm. Don't you think assuming utf-8 by default would be a good idea?

tcpdumping while uploading (this trace http://www.openstreetmap.org/user/one_half_3544/traces/1149633 ) shows that at least browser (firefox) lists utf-8 in Accept-Encoding:

19:42:31.827995 IP (tos 0x0, ttl 64, id 46125, offset 0, flags [DF], proto TCP (6), length 1500)
    192.168.1.35.50499 > soup.osm.ichosted.org.uk.www: Flags [.], cksum 0x6b4f (correct), seq 1:1449, ack 1, win 115, options [nop,nop,T
S val 6163510 ecr 795011323], length 1448
        0x0000:  4500 05dc b42d 4000 4006 b280 c0a8 0123  E....-@.@......#
        0x0010:  c13f 4b63 c543 0050 c958 b802 fc6e 645c  .?Kc.C.P.X...nd\
        0x0020:  8010 0073 6b4f 0000 0101 080a 005e 0c36  ...skO.......^.6
        0x0030:  2f62 e8fb 504f 5354 202f 7472 6163 652f  /b..POST./trace/
        0x0040:  6372 6561 7465 2048 5454 502f 312e 310d  create.HTTP/1.1.
        0x0050:  0a48 6f73 743a 2077 7777 2e6f 7065 6e73  .Host:.www.opens
        0x0060:  7472 6565 746d 6170 2e6f 7267 0d0a 5573  treetmap.org..Us
        0x0070:  6572 2d41 6765 6e74 3a20 4d6f 7a69 6c6c  er-Agent:.Mozill
        0x0080:  612f 352e 3020 2858 3131 3b20 4c69 6e75  a/5.0.(X11;.Linu
        0x0090:  7820 7838 365f 3634 3b20 7276 3a37 2e30  x.x86_64;.rv:7.0
        0x00a0:  2e31 2920 4765 636b 6f2f 3230 3130 3031  .1).Gecko/201001
        0x00b0:  3031 2046 6972 6566 6f78 2f37 2e30 2e31  01.Firefox/7.0.1
        0x00c0:  2049 6365 7765 6173 656c 2f37 2e30 2e31  .Iceweasel/7.0.1
        0x00d0:  0d0a 4163 6365 7074 3a20 7465 7874 2f68  ..Accept:.text/h
        0x00e0:  746d 6c2c 6170 706c 6963 6174 696f 6e2f  tml,application/
        0x00f0:  7868 746d 6c2b 786d 6c2c 6170 706c 6963  xhtml+xml,applic
        0x0100:  6174 696f 6e2f 786d 6c3b 713d 302e 392c  ation/xml;q=0.9,
        0x0110:  2a2f 2a3b 713d 302e 380d 0a41 6363 6570  */*;q=0.8..Accep
        0x0120:  742d 4c61 6e67 7561 6765 3a20 656e 2d75  t-Language:.en-u
        0x0130:  732c 656e 3b71 3d30 2e35 0d0a 4163 6365  s,en;q=0.5..Acce
        0x0140:  7074 2d45 6e63 6f64 696e 673a 2067 7a69  pt-Encoding:.gzi
        0x0150:  702c 2064 6566 6c61 7465 0d0a 4163 6365  p,.deflate..Acce
        0x0160:  7074 2d43 6861 7273 6574 3a20 4953 4f2d  pt-Charset:.ISO-
        0x0170:  3838 3539 2d31 2c75 7466 2d38 3b71 3d30  8859-1,utf-8;q=0
        0x0180:  2e37 2c2a 3b71 3d30 2e37 0d0a 436f 6e6e  .7,*;q=0.7..Conn
        0x0190:  6563 7469 6f6e 3a20 6b65 6570 2d61 6c69  ection:.keep-ali
        0x01a0:  7665 0d0a 5265 6665 7265 723a 2068 7474  ve..Referer:.htt
        0x01b0:  703a 2f2f 7777 772e 6f70 656e 7374 7265  p://www.openstre
        0x01c0:  6574 6d61 702e 6f72 672f 7472 6163 652f  etmap.org/trace/
        0x01d0:  6372 6561 7465 0d0a 436f 6f6b 6965 3a20  create..Cookie:.

And transmits utf-8 filename as is:

        0x0540:  223b 2066 696c 656e 616d 653d 2230 382d  ";.filename="08-
        0x0550:  3031 2d30 3320 d09b d0b5 d0b1 d18f d0b6  01-03...........
        0x0560:  d18c d0b5 202d 20d0 a1d0 bed1 81d0 bdd0  .....-..........
        0x0570:  bed0 b2d1 8bd0 b920 d0b1 d0be d180 5f72  .............._r
        0x0580:  6f61 6473 2e67 7078 220d 0a43 6f6e 7465  oads.gpx"..Conte
        0x0590:  6e74 2d54 7970 653a 2061 7070 6c69 6361  nt-Type:.applica
        0x05a0:  7469 6f6e 2f6f 6374 6574 2d73 7472 6561  tion/octet-strea
        0x05b0:  6d0d 0a0d 0a3c 3f78 6d6c 2076 6572 7369  m....<?xml.versi
        0x05c0:  6f6e 3d27 312e 3027 2065 6e63 6f64 696e  on='1.0'.encodin
        0x05d0:  673d 2755 5446 2d38 273f 3e0a            g='UTF-8'?>.

Some tcp packets later, in the same POST request comes trace description field (duplicates filename):

        0x04c0:  0d0a 436f 6e74 656e 742d 4469 7370 6f73  ..Content-Dispos
        0x04d0:  6974 696f 6e3a 2066 6f72 6d2d 6461 7461  ition:.form-data
        0x04e0:  3b20 6e61 6d65 3d22 7472 6163 655b 6465  ;.name="trace[de
        0x04f0:  7363 7269 7074 696f 6e5d 220d 0a0d 0a30  scription]"....0
        0x0500:  382d 3031 2d30 3320 d09b d0b5 d0b1 d18f  8-01-03.........
        0x0510:  d0b6 d18c d0b5 202d 20d0 a1d0 bed1 81d0  .......-........
        0x0520:  bdd0 bed0 b2d1 8bd0 b920 d0b1 d0be d180  ................
        0x0530:  3b20 d0a2 d180 d0b5 d0ba 20d0 bfd0 bed0  ;...............
        0x0540:  bbd1 8cd0 b7d0 bed0 b2d0 b0d1 82d0 b5d0  ................
        0x0550:  bbd1 8f20 564f 524f 4e20 d181 2076 656c  ....VORON....vel
        0x0560:  6f70 6974 6572 2e73 7062 2e72 7520 0d0a  opiter.spb.ru...
        0x0570:  2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d  ----------------
        0x0580:  2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d32 3130  -------------210
        0x0590:  3839 3933 3230 3531 3332 3438 3038 3130  8993205132480810

It comes in utf-8, but it is not converted to '_'.
So this looks like a server-side bug to me after all.

Do you know the place in the source, which handles trace upload? (or at least - where is the source hosted? =)) I have more traces, so I want to resolve this problem.

comment:3 Changed 7 years ago by TomH

It looks like it is something we are doing deliberately - the code is here:

http://git.openstreetmap.org/rails.git/blob/HEAD:/app/controllers/trace_controller.rb#l368

I would be reluctant to take any patch which changes that though without a thorough understanding of why it was put in and what the implications are (especially from a security point of view) of relaxing the sanitisation.

comment:4 Changed 7 years ago by one_half_3544

Well, I hoped to invite the author of that line of code to the conversation, but since you are already here

3db4ac9a (Tom Hughes              2007-09-25 23:18:32 +0000 369)     name = file.original_filename.gsub(/[^a-zA-Z0-9.]/, '_')

=)

In general modern filesystems (like ext3) should deal fine with utf-8 filenames. Wikipedia, for example, accepts utf-8 filenames as is. And, afaik, mediawiki engine stores them directly on the filesystem. Of course they could have changed that or could have introduced their own sanity checks. I'll study that, but at the first glance it seems that regex could be changed to utf-8 equivalent of :print:? without downgrading security.

comment:5 Changed 7 years ago by TomH

If you look more closely you'll see that I didn't actually write that line. All that happened in that commit is that I moved it as part of a refactoring of the code to merge the common parts of the two different paths for adding traces,

comment:6 Changed 7 years ago by one_half_3544

Oh, indeed

fb1467a9 (Steve Coast 2006-11-27 19:52:18 +0000 10)     @params['trace']['name'] = @params['trace']['gpx_file'].original_filename.gsub(/[^a-zA-Z0-9.]/, '_') # This makes sure filenames are sane

I'll try to contact him. Hope he remembers the reason of this.

Note: See TracTickets for help on using tickets.