With FastAPI, How to add charset to content-type (media-type) on request header on OpenAPI (Swagger) doc?
With FastAPI, how do I add charset
to the Content-Type request header in the automatically generated OpenAPI (Swagger) docs?
@app.post("/")
def post_hello(username: str = Form(...)):
return {"Hello": username}
With the above route, the OpenAPI docs generated at http://<root_path>/docs
shows application/x-www-form-urlencoded
.
I tried this:
@app.post("/")
def post_hello(username: str = Form(..., media_type="application/x-www-form-urlencoded; charset=cp932")):
return {"Hello": "World!", "userName": username}
But the docs still show only application/x-www-form-urlencoded
.
I want to set application/x-www-form-urlencoded; charset=cp932
as the value
of Content-Type
in responses from this endpoint/path function. And I want the
form data recieved to be decoded using that encoding scheme.
Short answer
It doesn't appear to be a good idea in the general case; I don't think there's
an easy, built-in way to do it; and it may not be necessary.
Not a good idea (a.k.a. not standards-compliant)
This GitHub issue discusses why appending ;charset=UTF-8
to
application/json
isn't a good idea, and the same points
brought up there apply in this case.
The HTTP/1.1 specification says that the Content-Type
header lists the Media Type.
note: HTTP/2 shares these components with HTTP/1.1
The IANA manages the registry of all the commonly used Media Types (MIME)..
The entry for application/x-www-form-urlencoded
says:
Media type name: application
Media subtype name: x-www-form-urlencoded
Required parameters: No parameters
Optional parameters:
No parameters
Encoding considerations: 7bit
Compare this to the entry for text/html
:
MIME media type name : Text
MIME subtype name : Standards Tree - html
Required parameters : No required parameters
Optional parameters :
charset
The charset parameter may be provided to definitively specify the document's character encoding, overriding any character encoding declarations in the document. The parameter's value must be one of the labels of the character encoding used to serialize the file.
Encoding considerations : 8bit
The entry for application/x-www-form-urlencoded
does not allow for charset
to be added. So how should it be decoded from bytes? The URL spec states:
- Let nameString and valueString be the result of running UTF-8 decode without BOM on the percent-decoding of name and value, respectively.
It sounds like, whatever the encoding is, UTF-8 should always be used in decoding.
The current HTML/URL specification also has this note about
application/x-www-form-urlencoded
:
The
application/x-www-form-urlencoded
format is in many ways an aberrant monstrosity, the result of many years of implementation accidents and compromises leading to a set of requirements necessary for interoperability, but in no way representing good design practices. In particular, readers are cautioned to pay close attention to the twisted details involving repeated (and in some cases nested) conversions between character encodings and byte sequences. Unfortunately the format is in widespread use due to the prevalence of HTML forms.
So it sounds like it wouldn't be a good idea to do something differently.
No built-in way
Note: the built-in way to do what these solutions do is to use a custom Request
class.
When building the /openapi.json
object, the current version of FastAPI
checks to see if a dependency is an instance of Form
, and then uses an empty
Form
instance to build the schema, even if the actual dependency is a
subclass of Form
.
The default value of the media_type
parameter of Form.__init__
is
application/x-www-form-urlencoded
, so every
endpoint/path function with a Form()
dependency will show the same Media Type
in the docs, even though the class __init__(
has a media_type
parameter.
There are a few ways of changing what's listed in /openapi.json
, which is
what's used to generate the documentation, and the FastAPI docs list one
official way.
For the example in the question, this would work:
from fastapi import FastAPI, Form
from fastapi.openapi.utils import get_openapi
app = FastAPI()
@app.post("/")
def post_hello(username: str = Form(...)):
return {"Hello": username}
def custom_openapi():
if app.openapi_schema:
return app.openapi_schema
app.openapi_schema = get_openapi(
title=app.title,
version=app.version,
openapi_version=app.openapi_version,
description=app.description,
terms_of_service=app.terms_of_service,
contact=app.contact,
license_info=app.license_info,
routes=app.routes,
tags=app.openapi_tags,
servers=app.servers,
)
requestBody = app.openapi_schema["paths"]["/"]["post"]["requestBody"]
content = requestBody["content"]
new_content = {
"application/x-www-form-urlencoded;charset=cp932": content[
"application/x-www-form-urlencoded"
]
}
requestBody["content"] = new_content
return app.openapi_schema
app.openapi = custom_openapi
It's worth noting that, with this change, the docs user interface changes how it presents the experimentation section:
Compared to how just application/x-www-form-urlencoded
with no charset
specified is shown:
The above changes would only change th Media Type listed in the docs. Any form data sent to an endpoint/path function would still be:
-
parsed by
python-multipart
(roughly following the same steps described in the specification) - decoded by
starlette
withLatin-1
So even if starlette
was changed to use a different encoding scheme for
decoding form data, python-multipart
still follows the steps outlined in the
specification to use hardcoded byte values for &
and ;
, for
example.
Fortunately, most* of the first 128 characters/codepoints are mapped to the same byte sequences between cp932 and UTF-8, so &
,
;
, and =
all come out the same.
*except for 0x5C
, which is sometimes ¥
One way to change starlette
to use cp932 encoding would be using a middleware:
import typing
from unittest.mock import patch
from urllib.parse import unquote_plus
import multipart
from fastapi import FastAPI, Form, Request, Response
from fastapi.openapi.utils import get_openapi
from multipart.multipart import parse_options_header
from starlette.datastructures import FormData, UploadFile
from starlette.formparsers import FormMessage, FormParser
app = FastAPI()
form_path = "/"
@app.post(form_path)
async def post_hello(username: str = Form(...)):
return {"Hello": username}
def custom_openapi():
if app.openapi_schema:
return app.openapi_schema
app.openapi_schema = get_openapi(
title=app.title,
version=app.version,
openapi_version=app.openapi_version,
description=app.description,
terms_of_service=app.terms_of_service,
contact=app.contact,
license_info=app.license_info,
routes=app.routes,
tags=app.openapi_tags,
servers=app.servers,
)
requestBody = app.openapi_schema["paths"]["/"]["post"]["requestBody"]
content = requestBody["content"]
new_content = {
"application/x-www-form-urlencoded;charset=cp932": content[
"application/x-www-form-urlencoded"
]
}
requestBody["content"] = new_content
return app.openapi_schema
app.openapi = custom_openapi
class CP932FormParser(FormParser):
async def parse(self) -> FormData:
"""
copied from:
https://github.com/encode/starlette/blob/0.17.1/starlette/formparsers.py#L72-L110
"""
# Callbacks dictionary.
callbacks = {
"on_field_start": self.on_field_start,
"on_field_name": self.on_field_name,
"on_field_data": self.on_field_data,
"on_field_end": self.on_field_end,
"on_end": self.on_end,
}
# Create the parser.
parser = multipart.QuerystringParser(callbacks)
field_name = b""
field_value = b""
items: typing.List[typing.Tuple[str, typing.Union[str, UploadFile]]] = []
# Feed the parser with data from the request.
async for chunk in self.stream:
if chunk:
parser.write(chunk)
else:
parser.finalize()
messages = list(self.messages)
self.messages.clear()
for message_type, message_bytes in messages:
if message_type == FormMessage.FIELD_START:
field_name = b""
field_value = b""
elif message_type == FormMessage.FIELD_NAME:
field_name += message_bytes
elif message_type == FormMessage.FIELD_DATA:
field_value += message_bytes
elif message_type == FormMessage.FIELD_END:
name = unquote_plus(field_name.decode("cp932")) # changed line
value = unquote_plus(field_value.decode("cp932")) # changed line
items.append((name, value))
return FormData(items)
class CustomRequest(Request):
async def form(self) -> FormData:
"""
copied from
https://github.com/encode/starlette/blob/0.17.1/starlette/requests.py#L238-L253
"""
if not hasattr(self, "_form"):
assert (
parse_options_header is not None
), "The `python-multipart` library must be installed to use form parsing."
content_type_header = self.headers.get("Content-Type")
content_type, options = parse_options_header(content_type_header)
if content_type == b"multipart/form-data":
multipart_parser = MultiPartParser(self.headers, self.stream())
self._form = await multipart_parser.parse()
elif content_type == b"application/x-www-form-urlencoded":
form_parser = CP932FormParser(
self.headers, self.stream()
) # use the custom parser above
self._form = await form_parser.parse()
else:
self._form = FormData()
return self._form
@app.middleware("http")
async def custom_form_parser(request: Request, call_next) -> Response:
if request.scope["path"] == form_path:
# starlette creates a new Request object for each middleware/app
# invocation:
# https://github.com/encode/starlette/blob/0.17.1/starlette/routing.py#L59
# this temporarily patches the Request object starlette
# uses with our modified version
with patch("starlette.routing.Request", new=CustomRequest):
return await call_next(request)
Then, the data has to be encoded manually:
>>> import sys
>>> from urllib.parse import quote_plus
>>> name = quote_plus("username").encode("cp932")
>>> value = quote_plus("cp932文字コード").encode("cp932")
>>> with open("temp.txt", "wb") as file:
... file.write(name + b"=" + value)
...
59
And sent as binary data:
$ curl -X 'POST' \
'http://localhost:8000/' \
-H 'accept: application/json' \
-H 'Content-Type: application/x-www-form-urlencoded;charset=cp932' \
--data-binary "@temp.txt" \
--silent \
| jq -C .
{
"Hello": "cp932文字コード"
}
Might not be necessary
In the manual encoding step, the output will look like:
username=cp932%E6%96%87%E5%AD%97%E3%82%B3%E3%83%BC%E3%83%89
Part of the percent-encoding step replaces any bytes representing
characters higher than 0x7E
(~
in ASCII) with bytes in a reduced ASCII
range. Since cp932 and UTF-8 both map these bytes to the same codepoints
(except for 0x5C
which may be \
or ¥
), the byte sequence will decode to
the same string:
$ curl -X 'POST' \
'http://localhost:8000/' \
-H 'accept: application/json' \
-H 'Content-Type: application/x-www-form-urlencoded;charset=cp932' \
--data-urlencode "username=cp932文字コード" \
--silent \
| jq -C .
{
"Hello": "cp932文字コード"
}
This is only true for percent-encoded data.
Any data sent without percent-encoding will be processed and interpreted
differently from how the sender intended it to be interpreted. For example, in
the OpenAPI (Swagger) docs, the "Try it out" experimentation section gives an
example with curl -d
(same as --data
), which doesn't urlencode the data:
$ curl -X 'POST' \
'http://localhost:8000/' \
-H 'accept: application/json' \
-H 'Content-Type: application/x-www-form-urlencoded' \
--data "username=cp932文字コード" \
--silent \
| jq -C .
{
"Hello": "cp932æ–‡å—コード"
}
It might still be a good idea to only use cp932 to process requests from senders that are configured in a way similar to the server.
One way to do this would be to modify the middleware function to only process data like this if the sender specifies that the data has been encoded with cp932:
import typing
from unittest.mock import patch
from urllib.parse import unquote_plus
import multipart
from fastapi import FastAPI, Form, Request, Response
from fastapi.openapi.utils import get_openapi
from multipart.multipart import parse_options_header
from starlette.datastructures import FormData, UploadFile
from starlette.formparsers import FormMessage, FormParser
app = FastAPI()
form_path = "/"
@app.post(form_path)
async def post_hello(username: str = Form(...)):
return {"Hello": username}
def custom_openapi():
if app.openapi_schema:
return app.openapi_schema
app.openapi_schema = get_openapi(
title=app.title,
version=app.version,
openapi_version=app.openapi_version,
description=app.description,
terms_of_service=app.terms_of_service,
contact=app.contact,
license_info=app.license_info,
routes=app.routes,
tags=app.openapi_tags,
servers=app.servers,
)
requestBody = app.openapi_schema["paths"]["/"]["post"]["requestBody"]
content = requestBody["content"]
new_content = {
"application/x-www-form-urlencoded;charset=cp932": content[
"application/x-www-form-urlencoded"
]
}
requestBody["content"] = new_content
return app.openapi_schema
app.openapi = custom_openapi
class CP932FormParser(FormParser):
async def parse(self) -> FormData:
"""
copied from:
https://github.com/encode/starlette/blob/0.17.1/starlette/formparsers.py#L72-L110
"""
# Callbacks dictionary.
callbacks = {
"on_field_start": self.on_field_start,
"on_field_name": self.on_field_name,
"on_field_data": self.on_field_data,
"on_field_end": self.on_field_end,
"on_end": self.on_end,
}
# Create the parser.
parser = multipart.QuerystringParser(callbacks)
field_name = b""
field_value = b""
items: typing.List[typing.Tuple[str, typing.Union[str, UploadFile]]] = []
# Feed the parser with data from the request.
async for chunk in self.stream:
if chunk:
parser.write(chunk)
else:
parser.finalize()
messages = list(self.messages)
self.messages.clear()
for message_type, message_bytes in messages:
if message_type == FormMessage.FIELD_START:
field_name = b""
field_value = b""
elif message_type == FormMessage.FIELD_NAME:
field_name += message_bytes
elif message_type == FormMessage.FIELD_DATA:
field_value += message_bytes
elif message_type == FormMessage.FIELD_END:
name = unquote_plus(field_name.decode("cp932")) # changed line
value = unquote_plus(field_value.decode("cp932")) # changed line
items.append((name, value))
return FormData(items)
class CustomRequest(Request):
async def form(self) -> FormData:
"""
copied from
https://github.com/encode/starlette/blob/0.17.1/starlette/requests.py#L238-L253
"""
if not hasattr(self, "_form"):
assert (
parse_options_header is not None
), "The `python-multipart` library must be installed to use form parsing."
content_type_header = self.headers.get("Content-Type")
content_type, options = parse_options_header(content_type_header)
if content_type == b"multipart/form-data":
multipart_parser = MultiPartParser(self.headers, self.stream())
self._form = await multipart_parser.parse()
elif content_type == b"application/x-www-form-urlencoded":
form_parser = CP932FormParser(
self.headers, self.stream()
) # use the custom parser above
self._form = await form_parser.parse()
else:
self._form = FormData()
return self._form
@app.middleware("http")
async def custom_form_parser(request: Request, call_next) -> Response:
if request.scope["path"] != form_path:
return await call_next(request)
content_type_header = request.headers.get("content-type", None)
if not content_type_header:
return await call_next(request)
media_type, options = parse_options_header(content_type_header)
if b"charset" not in options or options[b"charset"] != b"cp932":
return await call_next(request)
# starlette creates a new Request object for each middleware/app
# invocation:
# https://github.com/encode/starlette/blob/0.17.1/starlette/routing.py#L59
# this temporarily patches the Request object starlette
# uses with our modified version
with patch("starlette.routing.Request", new=CustomRequest):
return await call_next(request)
Security
Even with this modification, I think the note in the spec about parsing content with percent-decode should also be highlighted:
⚠ Warning! Using anything but UTF-8 decode without BOM when input contains bytes that are not ASCII bytes might be insecure and is not recommended.
So I would be cautious about implementing any of these solutions at all.