Apparate

The apparate cli is your point of contact for managing continuous delivery of Python packages for use in Databracks.

Configure

To get started, configure your Databricks account information. You’ll need your Databricks account connection info, and you will also be asked to name a production folder. To learn more about how these values will be used and where to find this information, check out the Getting Started page.

When you’re ready to go, run apparate configure.

$ apparate configure --help
Usage: apparate configure [OPTIONS]

  Configure information about Databricks account and default behavior.

  Configuration is stored in a `.apparatecfg` file. A config file must exist
  before this package can be used, and can be supplied either directly as a
  text file or generated using this configuration tool.

Options:
  --help  Show this message and exit.

Now you’re all set to start using apparate! The two main commands avaliable in apparate are upload and upload_and_update.

Upload

upload can be used anytime by anyone and promises not break anything. It simply uploads an egg file, and will throw an error if a file with the same name alreay exists.

If you’ve set up your .apparatecfg file using the configure command, you only need to provide a path to the .egg file, but can also override the default api token and destination folder if desired.

If you try to upload a library to Databricks that already exists there with the same version, a warning will be printed instructing the user to update the version if a change has been made. Without a version change the new library will not be uploaded.

This command will print out a message letting you know the name of the egg that was uploaded.

$ apparate upload --help
Usage: apparate upload [OPTIONS]

  The egg that the provided path points to will be uploaded to Databricks.

Options:
  -p, --path TEXT    path to egg file with name as output from setuptools
                     (e.g. dist/new_library-1.0.0-py3.6.egg)  [required]
  -t, --token TEXT   Databricks API key - optional, read from `.apparatecfg`
                     if not provided
  -f, --folder TEXT  Databricks folder to upload to (e.g.
                     `/Users/my_email@fake_organization.com`) - optional, read
                     from `.apparatecfg` if not provided
  --help             Show this message and exit.

Upload and Update

upload_and_update requires a token with admin-level permissions. It does have the capacity to delete libraries, but if used in a CI/CD system will not cause any issues. For advice on how to set this up, check out the Gettting Started page.

Used with default settings, upload_and_update will start by uploading the .egg file. It will then go find all jobs that use the same major version of the library and update them to point to the new version. Finally, it will clean up outdated versions in the production library. No libraries in any other folders will ever be deleted.

If you’re nervous about deleting files, you can always use the --no-cleanup flag and no files will be deleted or overwritten. If you’re confident in your CI/CD system, however, leaving the cleanup variable set to True will keep your production folder tidy, with only the most current version of each major release of each library.

This command will print out a message letting you know (1) the name of the egg that was uploaded, (2) the list of jobs currently using the same major version of this library, (3) the list of jobs updated - this should match number 2, and (4) any old versions removed - if you haven’t used the --no-cleanup flag.

In the same way as upload, if you try to upload a library to Databricks that already exists there with the same version, a warning will be printed instructing the user to update the version if a change has been made. Without a version change the new library will not be uploaded.

$ apparate upload_and_update --help
Usage: apparate upload_and_update [OPTIONS]

  The egg that the provided path points to will be uploaded to Databricks.
  All jobs which use the same major version of the library will be updated
  to use the new version, and all version of this library in the production
  folder with the same major version and a lower minor version will  be
  deleted.

  Unlike `upload`, `upload_and_update` does not ask for a folder because it
  relies on the production folder specified in the config. This is to
  protect against accidentally updating jobs to versions of a library still
  in testing/development.

  All egg names already in Databricks must be properly formatted  with
  versions of the form <name>-0.0.0.

Options:
  -p, --path TEXT           path to egg file with name as output from
                            setuptools (e.g. dist/new_library-1.0.0-py3.6.egg)
                            [required]
  -t, --token TEXT          Databricks API key with admin permissions on all
                            jobs using library - optional, read from
                            `.apparatecfg` if not provided
  --cleanup / --no-cleanup  if cleanup, remove outdated files from production
                            folder; if no-cleanup, remove nothing  [default:
                            True]
  --help                    Show this message and exit.

For more info about usage, check out the Tutorials.