Apparate¶
The apparate cli is your point of contact for managing continuous delivery of Python packages for use in Databracks.
Configure¶
To get started, configure your Databricks account information. You’ll need your Databricks account connection info, and you will also be asked to name a production folder. To learn more about how these values will be used and where to find this information, check out the Getting Started page.
When you’re ready to go, run apparate configure
.
$ apparate configure --help
Usage: apparate configure [OPTIONS]
Configure information about Databricks account and default behavior.
Configuration is stored in a `.apparatecfg` file. A config file must exist
before this package can be used, and can be supplied either directly as a
text file or generated using this configuration tool.
Options:
--help Show this message and exit.
Now you’re all set to start using apparate! The two main commands avaliable in apparate are upload
and upload_and_update
.
Upload¶
upload
can be used anytime by anyone and promises not break anything. It simply uploads an egg file, and will throw an error if a file with the same name alreay exists.
If you’ve set up your .apparatecfg
file using the configure
command, you only need to provide a path to the .egg
file, but can also override the default api token and destination folder if desired.
If you try to upload a library to Databricks that already exists there with the same version, a warning will be printed instructing the user to update the version if a change has been made. Without a version change the new library will not be uploaded.
This command will print out a message letting you know the name of the egg that was uploaded.
$ apparate upload --help
Usage: apparate upload [OPTIONS]
The egg that the provided path points to will be uploaded to Databricks.
Options:
-p, --path TEXT path to egg file with name as output from setuptools
(e.g. dist/new_library-1.0.0-py3.6.egg) [required]
-t, --token TEXT Databricks API key - optional, read from `.apparatecfg`
if not provided
-f, --folder TEXT Databricks folder to upload to (e.g.
`/Users/my_email@fake_organization.com`) - optional, read
from `.apparatecfg` if not provided
--help Show this message and exit.
Upload and Update¶
upload_and_update
requires a token with admin-level permissions. It does have the capacity to delete libraries, but if used in a CI/CD system will not cause any issues. For advice on how to set this up, check out the Gettting Started page.
Used with default settings, upload_and_update
will start by uploading the .egg
file. It will then go find all jobs that use the same major version of the library and update them to point to the new version. Finally, it will clean up outdated versions in the production library. No libraries in any other folders will ever be deleted.
If you’re nervous about deleting files, you can always use the --no-cleanup
flag and no files will be deleted or overwritten. If you’re confident in your CI/CD system, however, leaving the cleanup variable set to True
will keep your production folder tidy, with only the most current version of each major release of each library.
This command will print out a message letting you know (1) the name of the egg that was uploaded, (2) the list of jobs currently using the same major version of this library, (3) the list of jobs updated - this should match number 2, and (4) any old versions removed - if you haven’t used the --no-cleanup
flag.
In the same way as upload
, if you try to upload a library to Databricks that already exists there with the same version, a warning will be printed instructing the user to update the version if a change has been made. Without a version change the new library will not be uploaded.
$ apparate upload_and_update --help
Usage: apparate upload_and_update [OPTIONS]
The egg that the provided path points to will be uploaded to Databricks.
All jobs which use the same major version of the library will be updated
to use the new version, and all version of this library in the production
folder with the same major version and a lower minor version will be
deleted.
Unlike `upload`, `upload_and_update` does not ask for a folder because it
relies on the production folder specified in the config. This is to
protect against accidentally updating jobs to versions of a library still
in testing/development.
All egg names already in Databricks must be properly formatted with
versions of the form <name>-0.0.0.
Options:
-p, --path TEXT path to egg file with name as output from
setuptools (e.g. dist/new_library-1.0.0-py3.6.egg)
[required]
-t, --token TEXT Databricks API key with admin permissions on all
jobs using library - optional, read from
`.apparatecfg` if not provided
--cleanup / --no-cleanup if cleanup, remove outdated files from production
folder; if no-cleanup, remove nothing [default:
True]
--help Show this message and exit.
For more info about usage, check out the Tutorials.