Ingest new data
After you have installed and configured hila and ingested your initial batch of data, you need to ingest new data as it becomes available.
Prepare the new data for ingestion
hila does not enforce avoiding duplicates, so you need to manage data integrity and duplicity before data ingestion.
The recommendation for data ingestion on existing tables is as follows:
- For the journal table: use the append approach, which is default in the notebook, but make sure the delta data has no duplicate transactions before appending.
- For each table:
- If no new master data or no changes to the master data, then no action is required for that table.
- If any new master data or changes (such as new names), then replace the master data table with the latest. Make sure the latest master data has no data integrity issues.
Ingest the new data
-
In edahub, navigate to
notebooks/public/hila_dataloading/hila_cf/<CURRENT_RELEASE>
. -
Create a new directory and upload your data into it.
-
Open the notebook
hila_cf_e2e_REST.ipynb
.Note: This is the same notebook you used for the initial data ingestion. The rest of this procedure shows you how to modify the notebook to ingest new data.
-
In the Login cell, enter your username and password.
-
In the Set variables cell, set
INCREMENTAL_LOAD
toTrue
. -
Also in the Set variables cell, make sure
db_name
matches the database name you used for the initial data ingestion. -
In the Ingest data cell, change the value of
file_src_dir
to match the directory you created to uploaded the new data. -
If you uploaded CSV files rather than parquet, then in the Convert Parquet to CSV files (Optional) cell, comment out the line,
convert_files_to_csv(file_src_dir)
. This cell is only needed if you uploaded parquet files and want to convert them to CSV.- If you uploaded parquet files and want to use them directly as parquet files, then comment out this cell.
-
In the cell that contains the
dataloading_payload
object, you can choose to append the new data to the database (default) or replace the data in the database. To replace the data, change the value ofappend
toreplace
.- The notebook has code in previous cells that sets the filetype to
csv
orparquet
based on the file extension of the data you have loaded. But you can set it manually in thedataloading_payload
object if you want to set it here explicitly, by changing the value offiletype
to eithercsv
orparquet
depending on which type of files you have uploaded.
- The notebook has code in previous cells that sets the filetype to
-
Run the notebook.
-
When the notebook finishes, you can go to the hila UI, refresh your browser, and ask questions against the new data as long as you select the same data source from the initial upload.
Note: To test the data ingestion, if you ask the question, ‘How many rows in the journal table?’ before and after the data ingestion, you should see the number of rows increase by the number of new rows you ingested, if you set
overwrite
toappend
. If you setoverwrite
toreplace
, then it shows the number of rows you added in this latest ingestion.