Skip to content
This repository has been archived by the owner on Jul 24, 2021. It is now read-only.

Potlatch enters invalid UTF-8 into the database, resulting in the API outputing invalid XML #1936

Closed
openstreetmap-trac opened this issue Jul 23, 2021 · 8 comments

Comments

@openstreetmap-trac
Copy link

Reporter: avarab[at]gmail.com
[Submitted to the original trac issue database at 3.02pm, Monday, 8th June 2009]

When using Potlatch 1.0 on Ubuntu with Firefox 3.0.10 and Shockwave Flash 10.0 r22 entering (note: not copy/pasting) non-ASCII data into Potlatch will result in double encoded data being saved into the database. E.g. if I enter:

Hrgrbraut

In Potlatch it will be shown as:

Hrgrbraut

This is a known bug in itself. However Potlatch should do validation on this data before it saves it to the database. When I try to create a changeset in JOSM with this comment the API will respond with HTTP/1.1 400 Bad Request. Potlatch however will happily save it, resulting in this:

openstreetmap=# SELECT * FROM "changeset_tags" WHERE (("changeset_tags"."id" = 20));
 id |     k      |              v
----+------------+-----------------------------
  2 | created_by | Potlatch 1.0
  2 | comment    | Moved H\x03rg\x03rbraut

The API will then read that when serving read request via the API and output invalid XML as a result:

<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="OpenStreetMap server">
  <changeset id="20" user="var Arnfjr Bjarmason" uid="4" [... snip ...]>
    <tag k="created_by" v="Potlatch 1.0"/>
    <tag k="comment" v="H�rg�rbraut"/>
  </changeset>
</osm>

See xmlstarlet validation output:

avar@aoeu:/tmp$ xmlstarlet validate 20 .xml
20.xml:5: parser error : invalid character in attribute value
    <tag k="comment" v="Hrgrbraut"/>
                         ^
20.xml:5: parser error : attributes construct error
    <tag k="comment" v="Hrgrbraut"/>
                         ^
20.xml:5: parser error : Couldn't find end of Start Tag tag line 5
    <tag k="comment" v="Hrgrbraut"/>
                         ^
20.xml:5: parser error : PCDATA invalid Char value 3
    <tag k="comment" v="Hrgrbraut"/>
                         ^
20.xml:5: parser error : PCDATA invalid Char value 3
    <tag k="comment" v="Hrgrbraut"/>
                                ^
20.xml - invalid

So both Potlatch and the API are at fault, the API should handle invalid byte sequences in the database and output valid XML anyway, but if Potlatch did UTF-8 validation in the first place that invalid data wouldn't have been there.

@openstreetmap-trac
Copy link
Author

Author: Richard
[Added to the original trac issue at 3.06pm, Monday, 8th June 2009]

Sure, but you'll have to tell me how to do UTF-8 validation. I know nothing about it (and, in traditional Anglo-centric fashion, care less ;) ).

@openstreetmap-trac
Copy link
Author

Author: tom[at]compton.nu
[Added to the original trac issue at 3.08pm, Monday, 8th June 2009]

I believe one of the bots actually goes round fixing these things up in fact...

@openstreetmap-trac
Copy link
Author

Author: avarab[at]gmail.com
[Added to the original trac issue at 3.32pm, Monday, 8th June 2009]

As an example of the main API rejecting this:

create_changeset.xml:

<?xml version='1.0' encoding='UTF-8'?>
<osm version='0.6' generator='me'>
  <changeset visible='true'>
    <tag k='created_by' v='Ban Potlatch 1.0' />
    <tag k='comment' v='Hrgrbraut' />
  </changeset>
</osm>

Then POST it with curl:

$ curl -u "avarab[at]gmail.com":avarfoobar -i -o - -T create_changeset.txt 'http://osm.compton.nu/api/0.6/changeset/create'
HTTP/1.1 100 Continue

HTTP/1.1 100 Continue

HTTP/1.1 400 Bad Request
Date: Mon, 08 Jun 2009 15:28:43 GMT
Server: Apache/2.2.11 (Fedora) DAV/2 Phusion_Passenger/2.2.2 PHP/5.2.9 mod_python/3.3.1 Python/2.5.2
X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 2.2.2
X-Runtime: 65
Cache-Control: no-cache
Error: Cannot parse valid Changeset from xml string <?xml version='1.0' encoding='UTF-8'?>, <osm version='0.6' generator='me'>, <changeset visible='true'>, <tag k='created_by' v='Ban Potlatch 1.0' />, <tag k='comment' v='H�rg�rbraut' />, </changeset>, </osm>, .
Content-Length: 268
Status: 400
Content-Type: text/html; charset=utf-8
Connection: close

Cannot parse valid Changeset from xml string <?xml version='1.0' encoding='UTF-8'?>
<osm version='0.6' generator='me'>
  <changeset visible='true'>
    <tag k='created_by' v='Ban Potlatch 1.0' />
    <tag k='comment' v='H�rg�rbraut' />
  </changeset>
</osm>
. 

@openstreetmap-trac
Copy link
Author

Author: tom[at]compton.nu
[Added to the original trac issue at 3.33pm, Monday, 8th June 2009]

The reason the XML API rejects it is just that we us libxml2 to parse the incoming XML and that rejects it.

@openstreetmap-trac
Copy link
Author

Author: Matt
[Added to the original trac issue at 3.39pm, Monday, 8th June 2009]

Richard: in an anglo-saxon fashion - well, more of a teutonic fashion - i have already fixed it for you. see http://trac.openstreetmap.org/browser/sites/rails_port/lib/validators.rb#L21 and call it with any string you'd like to check.

@openstreetmap-trac
Copy link
Author

Author: avarab[at]gmail.com
[Added to the original trac issue at 3.56pm, Monday, 8th June 2009]

Replying to [comment:1 Richard]:

Sure, but you'll have to tell me how to do UTF-8 validation. I know nothing about it (and, in traditional Anglo-centric fashion, care less ;) ).

This is the string that Potlatch saves into the database, escaped:

H^C\303\203\302\266rg^C\303\203\302\241rbraut

The bit that makes this invalid UTF-8 is the ^C, the API would e.g. accept this:

H\303\203\302\266rg\303\203\302\241rbraut

Now, ^C here is \03, or ETX (end of text):

$ hexdump -C ban-potlatch 
00000000  48 03 c3 83 c2 b6 72 67  03 c3 83 c2 a1 72 62 72  |H.....rg.....rbr|
00000010  61 75 74 0a                                       |aut.|
00000014

But as this is a double-encoding issue in the first place perhaps you could do better than merely validating that the data is correct (& rejecting it), but automagically fix it too:

$ perl -MEncode=decode,encode -E '$s = qq[H^C\303\203\302\266rg^C\303\203\302\241rbraut]; say decode "utf-8", $s'
H^Crg^Crbraut

I.e. when you encounter \03 slurp up the trailing non-ASCII characters and presume that they're double-encoded data, and decode them.

If you did this then people could happily type "Hrgrbraut" into Potlatch, which flash on linux would convert into "
H\03\303\203\302\266rg\03\303\203\302\241rbraut", which the server side of Potlatch could then convert back into "Hrgrbraut".

@openstreetmap-trac
Copy link
Author

Author: avarab[at]gmail.com
[Added to the original trac issue at 5.54pm, Thursday, 16th July 2009]

Replying to [comment:7 Richard]:

.. in [16525]

@openstreetmap-trac
Copy link
Author

Author: avarab[at]gmail.com
[Added to the original trac issue at 6.04pm, Thursday, 16th July 2009]

And see related ticket:2072 which deals with the general issue of Potlatch not doing server-side validation of client-supplied data.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant