Multipart files unicode filename #2117

Closed

Contributor

Starting from 2.0 requests does not send filename attribute of Content-Disposition header for multipart files with unicode names. Instead of this attribute with name filename* is sent.

# requests 1.2.3
>>> requests.post('http://ya.ru', files={'file': (u'файл', '123')}).request.body
'--db7a9522a6344e26a4ca2933aecad887\r\nContent-Disposition: form-data; name="file"; filename="\xd1\x84\xd0\xb0\xd0\xb9\xd0\xbb"\r\nContent-Type: application/octet-stream\r\n\r\n123\r\n--db7a9522a6344e26a4ca2933aecad887--\r\n'

# requests 2.0
>>> requests.post('http://ya.ru', files={'file': (u'файл', '123')}).request.body
'--a9f0de2871da46df86140bc5b72fc722\r\nContent-Disposition: form-data; name="file"; filename*=utf-8\'\'%D1%84%D0%B0%D0%B9%D0%BB\r\n\r\n123\r\n--a9f0de2871da46df86140bc5b72fc722--\r\n'

And this is a big problem, because looks like some systems does not recognize such fields as files. At least we encountered a problem with Django. Django places entire file's content in request.POST instead of request.FILES. It is clear from sources:
https://github.com/django/django/blob/1.7c1/django/http/multipartparser.py#L599-L601

sigmavirus24

Contributor

This is actually a bug in Django then. Using this syntax is what we're supposed to be using. We have to indicate to the server that we're sending a field whose content is not ASCII or Latin-1 (ISO-8859-1). The proper way to do so is the syntax you see there. It is defined in Section 3.2.1 of RFC 5987. This bug should be filed against Django instead for not conforming to the proper handling of that value. Edit Note specifically that parameter is redefined as reg-parameter or ext-parameter where ext-parameter is defined as parmname (e.g., filename) concatenated with a * character followed by = and the ext-value. This confirms that this is the proper handling of those header values. I believe this RFC also is applied to MIME header values which are what you're using in your multipart/form-data upload.

homm

ContributorAuthor

I test php 5.5.9 server, and it also does not understand form-data with filename* attribute as files. I think requests should be close to real world then to rfc, especially when compromise is possible. How about both attributes?

Lukasa

Member

It's unclear what the compromise should be. What text encoding should we use?

homm

ContributorAuthor

Compromise is using both attributes. Encoding is not so important. It can be even filename=blah.bin, this will be enough for server to recognize files.

Lukasa

Member

Filename is hugely important. If we just make random choices, I guarantee that people will simply raise a new bug report claiming that we mangled their filenames.

If you don't like the way we do it, encode the filename yourself. =)

homm

ContributorAuthor

Feel the difference: "I can't even upload file to majority of servers" and "server does not understand file name".

There are two cases:

server understand unicode. It will use filename* property. No problems.
server does not understand unicode. It will not be able to receive files OR It will not be able to recognize name. Wait, it already does not understand unicode, it will not be able to recognize name in any case.

homm

ContributorAuthor

By the way, before 2.0 requests use some encoding, and as I know everything works well.

Lukasa

Member

Resist the temptation to say "Understands unicode". That phrase is meaningless when it comes to networked connections. The server needs to understand the specific text encoding we've chosen, and there is no consistency here. We will get this wrong, and what happens in this situation is totally unclear. Some servers will decode the text and get lucky, and read the filename as something totally obtuse and unclear, and that would be terrible.

I would rather fail clearly and allow the user to make the choice than to guess and get it wrong. This is not PHP.

sigmavirus24

Contributor

Compromise is using both attributes

How would using both attributes work? If you read the grammar I linked to, it updates the meaning to say that you can only send one, i.e., either filename= or filename*=. If you send both, you're sending the wrong thing.

By the way, before 2.0 requests use some encoding, and as I know everything works well.

It worked purely by chance that those characters could be encoded to bytes. We were just sending whatever the bytestring was (if you look at the string it is what you would get if you took your original string and called encode('<encoding>'). You get the literal bytes and that's what we sent. You could potentially still encode it yourself and then pass that as the filename. You really should only be giving us bytes anyway.

Seeing as this is not a bug, and there is no good reason for us to change how we handle what is given us by the user, I'm closing this.

sigmavirus24

closed this as completed

on Jul 2, 2014

homm

ContributorAuthor

How would using both attributes work? If you read the grammar I linked to, it updates the meaning to say that you can only send one. If you send both, you're sending the wrong thing.

This is totally wrong.

This specification suggests that a
parameter using the extended syntax takes precedence.  This would
allow producers to use both formats without breaking recipients that
do not understand the extended syntax yet.

sigmavirus24

Contributor

Interesting. I had forgotten that we could send both. Regardless, we can't provide an ASCII representation for you because we can't guess at how to generate ASCII from what you give us. You would still have to generate the value yourself because there's no sane way to provide that functionality via an API.

homm

ContributorAuthor

When I try to encode filename by myself as requests 1.2.3 does, I've got a error:

>>> requests.post('http://ubuntu:8000/', files={'file': (u'файл.png'.encode('utf-8'), '123')})
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 10: ordinal not in range(128)

I can't use just ascii representation for filename, because there is no way to send both filename with ascii and encoded filename*.

Lukasa

Member

There's so much going on in that line of code I don't know where the breakage is. Can you split that up into multiple lines so I can see which action causes that traceback?

homm

ContributorAuthor

>>> filename = u'файл.png'.encode('utf-8')
>>> requests.post('http://ubuntu:8000/', files={
  'file': (filename, '123')
})

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-5-42737369472a> in <module>()
----> 1 import requests; print requests.post('http://ubuntu:8000/', files={'file': (u'файл.png'.encode('utf-8'), '123')})

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/api.pyc in post(url, data, **kwargs)
     86     """
     87 
---> 88     return request('post', url, data=data, **kwargs)
     89 
     90 

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/api.pyc in request(method, url, **kwargs)
     42 
     43     session = sessions.Session()
---> 44     return session.request(method=method, url=url, **kwargs)
     45 
     46 

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert)
    420             hooks = hooks,
    421         )
--> 422         prep = self.prepare_request(req)
    423 
    424         proxies = proxies or {}

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/sessions.pyc in prepare_request(self, request)
    358             auth=merge_setting(auth, self.auth),
    359             cookies=merged_cookies,
--> 360             hooks=merge_hooks(request.hooks, self.hooks),
    361         )
    362         return p

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/models.pyc in prepare(self, method, url, headers, files, data, params, auth, cookies, hooks)
    295         self.prepare_headers(headers)
    296         self.prepare_cookies(cookies)
--> 297         self.prepare_body(data, files)
    298         self.prepare_auth(auth, url)
    299         # Note that prepare_auth must be last to enable authentication schemes

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/models.pyc in prepare_body(self, data, files)
    430             # Multi-part file uploads.
    431             if files:
--> 432                 (body, content_type) = self._encode_files(files, data)
    433             else:
    434                 if data:

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/models.pyc in _encode_files(files, data)
    147             rf = RequestField(name=k, data=fp.read(),
    148                               filename=fn, headers=fh)
--> 149             rf.make_multipart(content_type=ft)
    150             new_fields.append(rf)
    151 

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/packages/urllib3/fields.pyc in make_multipart(self, content_disposition, content_type, content_location)
    173         """
    174         self.headers['Content-Disposition'] = content_disposition or 'form-data'
--> 175         self.headers['Content-Disposition'] += '; '.join(['', self._render_parts((('name', self._name), ('filename', self._filename)))])
    176         self.headers['Content-Type'] = content_type
    177         self.headers['Content-Location'] = content_location

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/packages/urllib3/fields.pyc in _render_parts(self, header_parts)
    136         for name, value in iterable:
    137             if value:
--> 138                 parts.append(self._render_part(name, value))
    139 
    140         return '; '.join(parts)

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/packages/urllib3/fields.pyc in _render_part(self, name, value)
    116             The value of the parameter, provided as a unicode string.
    117         """
--> 118         return format_header_param(name, value)
    119 
    120     def _render_parts(self, header_parts):

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/packages/urllib3/fields.pyc in format_header_param(name, value)
     41         result = '%s="%s"' % (name, value)
     42         try:
---> 43             result.encode('ascii')
     44         except UnicodeEncodeError:
     45             pass

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 10: ordinal not in range(128)

22 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multipart files unicode filename #2117

22 remaining items

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Participants

Multipart files unicode filename #2117

Description

Activity

sigmavirus24 commented on Jul 2, 2014

homm commented on Jul 2, 2014

Lukasa commented on Jul 2, 2014

homm commented on Jul 2, 2014

Lukasa commented on Jul 2, 2014

homm commented on Jul 2, 2014

homm commented on Jul 2, 2014

Lukasa commented on Jul 2, 2014

sigmavirus24 commented on Jul 2, 2014

homm commented on Jul 2, 2014

sigmavirus24 commented on Jul 2, 2014

homm commented on Jul 3, 2014

Lukasa commented on Jul 3, 2014

homm commented on Jul 3, 2014

22 remaining items

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Participants

Issue actions