Ticket #1936 (closed defect: fixed)

Opened 5 years ago

Last modified 5 years ago

Potlatch enters invalid UTF-8 into the database, resulting in the API outputing invalid XML

Reported by: avarab@… Owned by: richard@…
Priority: minor Milestone:
Component: potlatch (flash editor) Version:
Keywords: potlatch, api, rails Cc:

Description

When using Potlatch 1.0 on Ubuntu with Firefox 3.0.10 and Shockwave Flash 10.0 r22 entering (note: not copy/pasting) non-ASCII data into Potlatch will result in double encoded data being saved into the database. E.g. if I enter:

Hörgárbraut

In Potlatch it will be shown as:

Hörgárbraut

This is a known bug in itself. However Potlatch should do validation on this data before it saves it to the database. When I try to create a changeset in JOSM with this comment the API will respond with HTTP/1.1 400 Bad Request. Potlatch however will happily save it, resulting in this:

openstreetmap=# SELECT * FROM "changeset_tags" WHERE (("changeset_tags"."id" = 20));
 id |     k      |              v
----+------------+-----------------------------
  2 | created_by | Potlatch 1.0
  2 | comment    | Moved H\x03örg\x03árbraut

The API will then read that when serving read request via the API and output invalid XML as a result:

<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="OpenStreetMap server">
  <changeset id="20" user="Ævar Arnfjörð Bjarmason" uid="4" [... snip ...]>
    <tag k="created_by" v="Potlatch 1.0"/>
    <tag k="comment" v="Hörgárbraut"/>
  </changeset>
</osm>

See xmlstarlet validation output:

avar@aoeu:/tmp$ xmlstarlet validate 20 .xml
20.xml:5: parser error : invalid character in attribute value
    <tag k="comment" v="Hörgárbraut"/>
                         ^
20.xml:5: parser error : attributes construct error
    <tag k="comment" v="Hörgárbraut"/>
                         ^
20.xml:5: parser error : Couldn't find end of Start Tag tag line 5
    <tag k="comment" v="Hörgárbraut"/>
                         ^
20.xml:5: parser error : PCDATA invalid Char value 3
    <tag k="comment" v="Hörgárbraut"/>
                         ^
20.xml:5: parser error : PCDATA invalid Char value 3
    <tag k="comment" v="Hörgárbraut"/>
                                ^
20.xml - invalid

So both Potlatch and the API are at fault, the API should handle invalid byte sequences in the database and output valid XML anyway, but if Potlatch did UTF-8 validation in the first place that invalid data wouldn't have been there.

Attachments

create_changeset.xml Download (221 bytes) - added by avarab@… 5 years ago.
An invalid changeset with the API rejects and Potlatch accepts

Change History

comment:1 follow-up: ↓ 6 Changed 5 years ago by Richard

Sure, but you'll have to tell me how to do UTF-8 validation. I know nothing about it (and, in traditional Anglo-centric fashion, care less ;) ).

comment:2 Changed 5 years ago by tom@…

  • Priority changed from critical to minor

I believe one of the bots actually goes round fixing these things up in fact...

comment:3 Changed 5 years ago by avarab@…

As an example of the main API rejecting this:

create_changeset.xml:

<?xml version='1.0' encoding='UTF-8'?>
<osm version='0.6' generator='me'>
  <changeset visible='true'>
    <tag k='created_by' v='Ban Potlatch 1.0' />
    <tag k='comment' v='Hörgárbraut' />
  </changeset>
</osm>

Then POST it with curl:

$ curl -u "avarab@gmail.com":avarfoobar -i -o - -T create_changeset.txt 'http://osm.compton.nu/api/0.6/changeset/create'
HTTP/1.1 100 Continue

HTTP/1.1 100 Continue

HTTP/1.1 400 Bad Request
Date: Mon, 08 Jun 2009 15:28:43 GMT
Server: Apache/2.2.11 (Fedora) DAV/2 Phusion_Passenger/2.2.2 PHP/5.2.9 mod_python/3.3.1 Python/2.5.2
X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 2.2.2
X-Runtime: 65
Cache-Control: no-cache
Error: Cannot parse valid Changeset from xml string <?xml version='1.0' encoding='UTF-8'?>, <osm version='0.6' generator='me'>, <changeset visible='true'>, <tag k='created_by' v='Ban Potlatch 1.0' />, <tag k='comment' v='Hörgárbraut' />, </changeset>, </osm>, .
Content-Length: 268
Status: 400
Content-Type: text/html; charset=utf-8
Connection: close

Cannot parse valid Changeset from xml string <?xml version='1.0' encoding='UTF-8'?>
<osm version='0.6' generator='me'>
  <changeset visible='true'>
    <tag k='created_by' v='Ban Potlatch 1.0' />
    <tag k='comment' v='Hörgárbraut' />
  </changeset>
</osm>
. 

Changed 5 years ago by avarab@…

An invalid changeset with the API rejects and Potlatch accepts

comment:4 Changed 5 years ago by tom@…

The reason the XML API rejects it is just that we us libxml2 to parse the incoming XML and that rejects it.

comment:5 Changed 5 years ago by Matt

Richard: in an anglo-saxon fashion - well, more of a teutonic fashion - i have already fixed it for you. see  http://trac.openstreetmap.org/browser/sites/rails_port/lib/validators.rb#L21 and call it with any string you'd like to check.

comment:6 in reply to: ↑ 1 Changed 5 years ago by avarab@…

Replying to Richard:

Sure, but you'll have to tell me how to do UTF-8 validation. I know nothing about it (and, in traditional Anglo-centric fashion, care less ;) ).

This is the string that Potlatch saves into the database, escaped:

H^C\303\203\302\266rg^C\303\203\302\241rbraut

The bit that makes this invalid UTF-8 is the C, the API would e.g. accept this:

H\303\203\302\266rg\303\203\302\241rbraut

Now, C here is \03, or ETX (end of text):

$ hexdump -C ban-potlatch 
00000000  48 03 c3 83 c2 b6 72 67  03 c3 83 c2 a1 72 62 72  |H.....rg.....rbr|
00000010  61 75 74 0a                                       |aut.|
00000014

But as this is a double-encoding issue in the first place perhaps you could do better than merely validating that the data is correct (& rejecting it), but automagically fix it too:

$ perl -MEncode=decode,encode -E '$s = qq[H^C\303\203\302\266rg^C\303\203\302\241rbraut]; say decode "utf-8", $s'
H^Cörg^Cárbraut

I.e. when you encounter \03 slurp up the trailing non-ASCII characters and presume that they're double-encoded data, and decode them.

If you did this then people could happily type "Hörgárbraut" into Potlatch, which flash on linux would convert into " H\03\303\203\302\266rg\03\303\203\302\241rbraut", which the server side of Potlatch could then convert back into "Hörgárbraut".

comment:7 follow-up: ↓ 8 Changed 5 years ago by Richard

  • Status changed from new to closed
  • Resolution set to fixed

comment:8 in reply to: ↑ 7 Changed 5 years ago by avarab@…

Replying to Richard:

.. in [16525]

comment:9 Changed 5 years ago by avarab@…

And see related ticket:2072 which deals with the general issue of Potlatch not doing server-side validation of client-supplied data.

Note: See TracTickets for help on using tickets.