Skip to content

API validation does not successfully validate existing data #1927

@dotwaffle

Description

@dotwaffle

Describe the bug
Validation statements for API fields does not match what is currently in the database, and therefore it's not possible to import the existing data into a spec-compliant API implementation.

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'https://www.peeringdb.com/apidocs/#tag/api/operation/create%20net'
  2. See that "aka" has validation of "<= 255 characters".
  3. Run: curl -fsSL "http://peeringdb.com/api/net/6960" | jq '.data.[0].aka | length'
    (output) 254 (<= 255, so seems ok...)
  4. Run: curl -fsSL "http://peeringdb.com/api/net/6960" | jq -j '.data.[0].aka' | wc -c
    (output) 256 (> 255)

So, what's going on here? Well, I'm assuming that the validation is counting UTF-8 codepoints (literally: characters) and not bytes. Let's validate that:
5. Run: curl -fsSL "http://peeringdb.com/api/net/6960" | jq '.data.[0].aka'
(output) "26617 Navega 23243 Comcel Guatemala 27773 Millicom El Salvador 17079 Telemóvil El Salvador 52262 Telefónica Celular 23383 Metrored 20299 Newcom Limited 262197 Millicom Costa Rica 52362 SICESA 18809 Cable Onda 27742 Amnet 28036 Telefonia Celular Nicaragua"
"Telemóvil" is 9 characters, but 10 bytes. Same effect with "Telefónica". Those two accented characters are 16-bits wide, and therefore fail.

Now, theoretically, what is happening could be construed as correct, except that you can't use UTF-8 in URIs, they have to be percent-encoded: https://swagger.io/specification/#illegal-variable-names-as-parameter-names

While OpenAPI 3.0/3.1 is unclear on what to do here, OpenAPI 3.2 is more explicit:

The maxLength keyword MAY be used to set an expected upper bound on the length of a streaming payload that consists of either string data, including encoded binary data, or unencoded binary data. For unencoded binary data, the length is the number of octets.

ref: https://spec.openapis.org/oas/v3.2.0.html#binary-streams

Who is affected by the problem?
Anyone using the validation parts of the API, and not using peeringdb-py (e.g. me, using Go).

What is the impact?

The current PeeringDB data does not import, causing a validation failure.

Are there security concerns?
Nope. Well, a potential DoS on API consumers who are validating if someone was to start straddling the limits on particular fields, but that's happening today.

Are there privacy concerns?
Nope.

What are the proposed actions?
Count bytes instead of characters/runes for validation purposes, and truncate (and notify the owner of) any data currently in the database that does not conform.

What is the proposed priority?
Not that urgent, though it is blocking me right now.

Provide a rationale for any/all of the above
Moving from ISO 8859-1 / ISO 8859-15 (one byte per character) to UTF-8 (variable length characters)... I guess you could blame Ken Thompson and Rob Pike back in 1992, but I think on the whole UTF-8 has been a good thing.

Additional context
I have a working (except for this) PeeringDB caching proxy written in Go, able to run in fly.io, with geo-replicas (using LiteFS) all around the globe, making latency much much lower, and able to answer thousands of queries per second without breaking a sweat. It would essentially close #1835 and #1865, and operate during any maintenance / outages, albeit with data up to 15 minutes old. Operating 9 POPs (LHR / IAD / GRU / SIN / SYD / NRT / JNB / BOM / SJC) would cost USD 81.22 / month, approx. half of which is bandwidth (100GB/month at each POP). This is practically the only thing left needing fixing before I can deploy it with live data as a showcase.

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions