Ingest new data

After you have installed and configured hila and ingested your initial batch of data, you need to ingest new data as it becomes available.

Prepare the new data for ingestion

hila does not enforce avoiding duplicates, so you need to manage data integrity and duplicity before data ingestion.

The recommendation for data ingestion on existing tables is as follows:

For the journal table: use the append approach, which is default in the notebook, but make sure the delta data has no duplicate transactions before appending.
For each table:
- If no new master data or no changes to the master data, then no action is required for that table.
- If any new master data or changes (such as new names), then replace the master data table with the latest. Make sure the latest master data has no data integrity issues.

Ingest the new data

Open edahub.
In edahub, navigate to notebooks/public/hila_dataloading/hila_cf/<CURRENT_RELEASE>.
Create a new directory and upload your data into it.
Open the notebook hila_cf_e2e_REST.ipynb.

Note: This is the same notebook you used for the initial data ingestion. The rest of this procedure shows you how to modify the notebook to ingest new data.
In the Login cell, enter your username and password.
In the Set variables cell, set INCREMENTAL_LOAD to True.
Also in the Set variables cell, make sure db_name matches the database name you used for the initial data ingestion.
In the Ingest data cell, change the value of file_src_dir to match the directory you created to uploaded the new data.
If you uploaded CSV files rather than parquet, then in the Convert Parquet to CSV files (Optional) cell, comment out the line, convert_files_to_csv(file_src_dir). This cell is only needed if you uploaded parquet files and want to convert them to CSV.
- If you uploaded parquet files and want to use them directly as parquet files, then comment out this cell.
In the cell that contains the dataloading_payload object, you can choose to append the new data to the database (default) or replace the data in the database. To replace the data, change the value of append to replace.
- The notebook has code in previous cells that sets the filetype to csv or parquet based on the file extension of the data you have loaded. But you can set it manually in the dataloading_payload object if you want to set it here explicitly, by changing the value of filetype to either csv or parquet depending on which type of files you have uploaded.
Run the notebook.
When the notebook finishes, you can go to the hila UI, refresh your browser, and ask questions against the new data as long as you select the same data source from the initial upload.

Note: To test the data ingestion, if you ask the question, ‘How many rows in the journal table?’ before and after the data ingestion, you should see the number of rows increase by the number of new rows you ingested, if you set overwrite to append. If you set overwrite to replace, then it shows the number of rows you added in this latest ingestion.

TABLE OF CONTENTS