File Definition Concepts

When are File Definitions required?

For some users, you may never need to use file definitions as all your data will be transferred between databases and non-file based storage means.

File definitions are required for when you need to work with data in files rather than database tables and APIs. Some common use cases for file definitions include:

Working with flat files stored on a local server

This is a perfect example of creating a File System file definition, as it allows you to easily work with flat files stored on the same server as the agent.

Take the following example.

file example

If I wanted to use the data stored in these CSV files, it’s as simple as creating a file definition with the following values:

Field Value Explanation
File Definition Types File System We are working with files stored on the local file system.
Path C:\CSVData This is the folder that contains all the CSV files we’re wanting to query from.
File Format Delimited CSV files are delimited, with a , being the delimiter.
Delimiter , As mentioned previously, this is the value that separates cells in a CSV file.

file definitions

Once this is configured, you can use the file definition as a source in data migrations - each table essentially like a table in a database schema.

file def migraiton

File System

File System File Definitions are used for working with traditional file storage means such as on a server computer’s hard disk.

In the cases where you need Data Governor to process files stored on the same host as the agent, you use the File System type.

Azure Blob Folder

Azure Blob Folders are file definitions that can be used to pull files out a folder in an Azure Blob container.

These are used essentially as the “schema” to an Azure Blob Connection.

HDFS/DBFS (Hadoop or Databricks)

When working with big data platforms such as Hadoop and Databricks, there is a need to upload data to the file systems so that the data processing platforms can perform operations and queries on it.

The Hadoop File System and Databricks File System (HDFS/DBFS) file definition type is used to define target connections for data migrations, where the target connection is one of the previously mentioned big data platforms.

Currently HDFS/DBFS File Definitions only support uploading delimited flat files to to the targets. In future releases binary file formats such as Apache Parquet will be supported.