Ingest custom data

If the data you want to ingest does not follow the standard Conversational Finance schema, you can ingest data using a custom format.

As part of ingest, you can build a pipeline of transformations to modify your data. The pipeline is not a full ETL tool, but it can make some minor adjustments to your data to make it consistent with your format. This is particularly useful if you have already ingested your initial data and follow-up data is in a different format to your initial data.

hila has a sample notebook that you can run to see how custom ingestion works. It uses the Financial Report of the U.S. Government as its data, which is entirely different from the standard CF schema. You can modify the notebook to ingest your own data.

Steps

Visit Federal Report of the U.S. Government.
1. Select 1 Year under Data Range (Record Date).
2. Click Download CSV File.
Open the notebook directory in edahub.
1. Open edahub as described in Open edahub.
2. Navigate to the directory that contains the notebook: /notebooks/public/hila_dataloading/hila_cf/sample
Upload the data.
1. Create a new directory for the data called usfr_data.
2. Click into the new directory.
3. Click the Upload files icon at the top of the left sidebar.
4. Navigate to the directory where you downloaded the data and select the file.
5. The file is named something like USFR_StmtNetCost_<date_info>.csv.
6. Rename the file to usfr.csv. Note: you can rename this to anything, but hila expects this filename to contain only lowercase letters and underscores.
7. Navigate up a level to the sample directory.
Prepare the pipeline and table_structure files.
1. Open the pipeline.json file.
  - This file is a list of objects that define the transformations to apply to the data.
  - For this sample exercise, the three transformations create standard column names, remove white spaces from column names, and make a standard datetime column.
  - You can find other available transformations in edahub under /source/vianai/preprocessing. You can modify these for your own use or create your own.
2. Open the table_structure.json file.
  - This file defines the schema of the data as a list of objects that define the column name, data type, and other parameters.
  - While this file matches the columns and data types of the USFR data, you can modify this file to match your data.
  - The data types can be string, int, float, and date32.
  - The “default” and “initial” keys are relevant only when “required” is true. These values are placeholders in case the column is missing in the data source during ingest.
    - “default” is the value to populate the column if the column is missing.
    - “initial” is the name of the missing column, in case it is different from the one already defined as the key to the object.
Run the notebook.
1. Open the notebook sample_USFR_REST.ipynb.
2. In the cell labeled Login, enter your username and password.
3. Run the notebook.
4. Verify the notebook ran successfully by checking the output of the last cell appears as follows:
Ask questions.
1. Open the hila ui application.
2. Select the dataset you created with this procedure.
3. Ask questions in the conversation window.

TABLE OF CONTENTS