How To Upload Larger Than 5gigs On Box

It can be a real hurting to upload huge files. Many services limit their upload sizes to a few megabytes, and y'all don't desire a single connection open up forever either. The super unproblematic way to get around that is simply send the file in lots of modest parts, aka chunking.

UPDATE: Bank check out the new article, which includes calculation parallel chucking for speed improvements.

Finished lawmaking example tin can be viewed at github.

So there are going to be two parts to making this work, the front end-terminate (website) and backend (server). Lets start on what the user volition see.

Webpage with Dropzone.js

Beautiful, ain't it? The all-time office is, the lawmaking powering it is simply equally succinct.

<!doctype html> <html lang="en"> <head>      <meta charset="UTF-8">      <link rel="stylesheet"       href="https://cdnjs.cloudflare.com/ajax/libs/dropzone/v.iv.0/min/dropzone.min.css"/>      <link rel="stylesheet"       href="https://cdnjs.cloudflare.com/ajax/libs/dropzone/five.iv.0/min/basic.min.css"/>      <script type="application/javascript"       src="https://cdnjs.cloudflare.com/ajax/libs/dropzone/v.4.0/min/dropzone.min.js">     </script>      <title>File Dropper</title> </caput> <body>  <class method="POST" activity='/upload' class="dropzone dz-clickable"        id="dropper" enctype="multipart/form-data"> </course>   </body> </html>

This is using the dropzone.js library, which has no additional dependencies and decent CSS included. All you lot take to do is add the course "dropzone" to a form and it automatically turns it into i of their special elevate and drop fields (you can also click and select).

However, past default, dropzone does not clamper files. Luckily, information technology is really piece of cake to enable. We are going to add some custom JavaScript and insert it between the form and the cease of the torso

</form>  <script type="application/javascript">     Dropzone.options.dropper = {         paramName: 'file',         chunking: true,         forceChunking: true,         url: '/upload',         maxFilesize: 1025,          chunkSize: one thousand thousand      } </script>  </torso>

When enabling chunking, information technology will break up any files larger than the chunkSize and send them to the server over multiple requests. It accomplishes this past adding course data that has information about the chunk (uuid, electric current clamper, total chunks, chunk size, total size). By default, anything nether that size will not take that information ship every bit role of the form data and the server would have to have an additional logic path. Thankfully, in that location is the forceChunking pick which will ever send that information, even if it's a smaller file. Everything else is pretty cocky-explanatory, but if yous desire more details near the possible options, just cheque out their listing of configuration options.

Python Flask Server

Onto the backend. I am going to be using Flask, which is currently the most popular Python spider web framework (by github stars), other skillful options include Bottle and CherryPy. If you hate yourself or your colleagues, you could besides employ Django or Pyramid. There are a ton of good instance Flask projects, and boiler plates to kickoff from, I am going to use one that I have created for my own utilize that fits my needs, but don't experience obligated to utilise it.

This type of upload will piece of work across any existent website back-end. Yous will simply need ii routes, one that displays the frontend, and the other that accepts the file equally an upload. At first, lets just view what dropzone is sending u.s.a.. In this example my projection's proper noun is called 'pydrop', and if you're using my FlaskBootstrap code, this is the views/templated.py file.

#!/usr/bin/env python # -*- coding: UTF-8 -*- import logging import bone  from flask import render_template, Blueprint, asking, make_response from werkzeug.utils import secure_filename  from pydrop.config import config  blueprint = Blueprint('templated', __name__, template_folder='templates')  log = logging.getLogger('pydrop')   @design.road('/') @design.route('/alphabetize') def alphabetize():     # Route to serve the upload form     render render_template('index.html',                            page_name='Principal',                            project_name="pydrop")   @blueprint.route('/upload', methods=['Postal service']) def upload():     # Route to deal with the uploaded chunks     log.info(request.form)     log.info(request.files)     return make_response(('ok', 200))

Run the flask server and upload a modest file (nether the size of the clamper limit). It should log a unmarried instance of a Post to /upload:

[INFO] werkzeug: 127.0.0.1 "POST /upload HTTP/1.ane" 200 -  [INFO] pydrop: ImmutableMultiDict([      ('dzuuid', '807f99b7-7f58-4d9b-ac05-2a20f5e53782'),       ('dzchunkindex', '0'),       ('dztotalfilesize', '1742'),       ('dzchunksize', '1000000'),       ('dztotalchunkcount', 'i'),       ('dzchunkbyteoffset', '0')])  [INFO] pydrop: ImmutableMultiDict([      ('file', &lt;FileStorage: 'README.md' ('awarding/octet-stream')&gt;)])

Lets break down what information nosotros are getting:

dzuuid – Unique identifier of the file being uploaded

dzchunkindex – Which cake number nosotros are currently on

dztotalfilesize – The entire file's size

dzchunksize – The max chunk size set on the frontend (note this may be larger than the actual chuck'due south size)

dztotalchunkcount – The number of chunks to expect

dzchunkbyteoffset – The file showtime nosotros need to go on appending to the file being uploaded

Adjacent, let's upload something just a chip larger that will crave it to be chunked into multiple parts:

[INFO] werkzeug: 127.0.0.ane "POST /upload HTTP/1.1" 200 -  [INFO] pydrop: ImmutableMultiDict([     ('dzuuid', 'b4b2409a-99f0-4300-8602-8becbef24c91'),      ('dzchunkindex', '0'),      ('dztotalfilesize', '1191708'),      ('dzchunksize', 'million'),      ('dztotalchunkcount', 'ii'),      ('dzchunkbyteoffset', '0')])  [INFO] pydrop: ImmutableMultiDict([     ('file', &lt;FileStorage: '04vfpknzx8z01.png' ('application/octet-stream')&gt;)])    [INFO] werkzeug: 127.0.0.1 "Mail /upload HTTP/1.ane" 200 -  [INFO] pydrop: ImmutableMultiDict([     ('dzuuid', 'b4b2409a-99f0-4300-8602-8becbef24c91'),      ('dzchunkindex', '1'),     ('dztotalfilesize', '1191708'),       ('dzchunksize', '1000000'),      ('dztotalchunkcount', '2'),      ('dzchunkbyteoffset', 'one thousand thousand')])  [INFO] pydrop: ImmutableMultiDict([     ('file', &lt;FileStorage: '04vfpknzx8z01.png' ('application/octet-stream')&gt;)])

Observe how/upload has been called twice. And that the dzchunkindex and dzchunkbyteoffset accept been updated accordingly. That means our upload function has to be smart enough to handle both new requests and existing multipart uploads. That ways for new requests we should open existing files and only write data after the data already in them, whereas we volition create a file and beginning at the starting time for new uploads. Luckily, both tin can be accomplished by opening with the same code. First open file in append mode, then 'seek' to the end of the electric current data (in this case we are relying on the seek showtime to be provided by dropzone.)

@blueprint.route('/upload', methods=['Post']) def upload():     # Remember the paramName was ready to 'file', nosotros can use that here to grab it     file = request.files['file']      # secure_filename makes certain the filename isn't dangerous to relieve     save_path = os.path.join(config.data_dir, secure_filename(file.filename))      # We need to suspend to the file, and write every bit bytes     with open(save_path, 'ab') as f:         # Goto the offset, aka afterwards the chunks we already wrote          f.seek(int(asking.form['dzchunkbyteoffset']))         f.write(file.stream.read())             # Giving it a 200 ways it knows everything is ok     return make_response(('Uploaded Chunk', 200))

At this indicate you should accept a working upload script, tada!

But lets beefiness this up a lilliputian bit. The post-obit lawmaking improvements arrive and then we don't overwrite existing files that have already been uploaded, checks the file size matches what nosotros expect when we're done, and gives a lilliputian more output along the way.

@pattern.route('/upload', methods=['POST']) def upload():     file = request.files['file']      save_path = os.path.join(config.data_dir, secure_filename(file.filename))     current_chunk = int(request.class['dzchunkindex'])      # If the file already exists information technology's ok if we are appending to it,     # only not if it'south new file that would overwrite the existing one     if os.path.exists(save_path) and current_chunk == 0:         # 400 and 500s will tell dropzone that an mistake occurred and show an error         return make_response(('File already exists', 400))      try:         with open up(save_path, 'ab') as f:             f.seek(int(asking.course['dzchunkbyteoffset']))             f.write(file.stream.read())     except OSError:         # log.exception will include the traceback then we tin see what'south wrong          log.exception('Could non write to file')         return make_response(("Not sure why,"                               " but we couldn't write the file to disk", 500))      total_chunks = int(request.course['dztotalchunkcount'])      if current_chunk + 1 == total_chunks:         # This was the last clamper, the file should be complete and the size we expect         if os.path.getsize(save_path) != int(request.class['dztotalfilesize']):             log.mistake(f"File {file.filename} was completed, "                       f"but has a size mismatch."                       f"Was {os.path.getsize(save_path)} but nosotros"                       f" expected {asking.form['dztotalfilesize']} ")             render make_response(('Size mismatch', 500))         else:             log.info(f'File {file.filename} has been uploaded successfully')     else:         log.debug(f'Clamper {current_chunk + i} of {total_chunks} '                   f'for file {file.filename} complete')      return make_response(("Chunk upload successful", 200))

Now lets give this a try:

[DEBUG] pydrop: Chunk 1 of 6 for file DSC_0051-one.jpg complete [DEBUG] pydrop: Chunk ii of half dozen for file DSC_0051-1.jpg complete [DEBUG] pydrop: Chunk 3 of 6 for file DSC_0051-one.jpg complete [DEBUG] pydrop: Chunk four of 6 for file DSC_0051-1.jpg complete [DEBUG] pydrop: Clamper 5 of 6 for file DSC_0051-ane.jpg complete [INFO] pydrop: File DSC_0051-1.jpg has been uploaded successfully

Sweet! But wait, what if we remove the directories where the files are stored? Or endeavour to upload the same file once again?

(Dropzone'south text out of the box is a little hard to read, simply information technology says "File already exists" on the left and "Not sure why, but nosotros couldn't write file the disk" on the right. Exactly what we'd expect.)

2018-05-28 14:29:nineteen,311 [ERROR] pydrop: Could not write to file Traceback (virtually recent call terminal):     .... FileNotFoundError: [Errno 2] No such file or directory:

We go mistake message on the webpage and in the logs, perfect.

I hope y'all constitute this data useful and if you have any suggestions on how to improve it, please let me know!

Thinking further down the road

In the long-term I would have a database or some permanent storage pick to keep track of file uploads. That way you could see if one fails or stops halfway and exist able to remove incomplete ones. I would as well base saving files first into a temp directory based off their UUID then, when complete, moving them to a place based off their file hash. Would too be nice to have a page to see everything uploaded and manage directories or other options, or even password protected uploads.