NFHS-4
Domain Name | Health |
Files Shared | 5 |
Sheets Shared | 10 |
Files Ingested | 5 |
Sheets Ingested | 10 |
Ingestion % | 100% |
Landing Tables | 26 |
Staging Tables | 10 |
Average Rating | ** (Difficult) |
Processing Error Rate | 10% |
Record Error Rate | 20% |
File Format | .DCT, .DAT |
LGD Code Included | YES |
Raw data S3 Path | NFHS4/ |
Pipeline Path | NFHS4/ |
From given raw data we ingested 26 tables in landing, and we consolidated 10 tables to staging. Following are the challenges we have faced – Some attributes in csv file does not present in Map files. No. of columns is high. Data at household granularity, some columns with “_” in them need transformation and will be combined into one while adding into staging this will reduce the error rate
NFHS-5
Domain Name | Health |
Files Shared | 2 |
Sheets Shared | 4 |
Files Ingested | 2 |
Sheets Ingested | 4 |
Ingestion % | 100% |
Landing Tables | 10 |
Staging Tables | 3 |
Average Rating | * (Very Difficult) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | .MAP, .csv |
LGD Code Included | YES |
Raw data S3 Path | NFHS5/ |
Pipeline Path | NFHS5/ |
From given raw data we ingested 10 tables in landing, and we consolidated 3 tables to staging. Following are the challenges we have faced – file sizes high. No. of columns is high. Data at household granularity, some columns with “_” in them need transformation and will be combined into one while adding into staging this will reduce the error rate.
PDS – Ahara kfcsc
Domain Name | Health |
Files Shared | 503 |
Sheets Shared | 11,220 |
Files Ingested | 503 |
Sheets Ingested | 11,220 |
Ingestion % | 100% |
Landing Tables | 10396 |
Staging Tables | 0 |
Average Rating | *** (Medium) |
Processing Error Rate | 0% |
Record Error Rate | In Progress |
File Format | Excel |
LGD Code Included | NO |
Raw data S3 Path | Ahara-Kfcsc/ |
Pipeline Path | PL_P0_AHARA.ipynb |
From given raw data we ingested 10396 tables in landing. Following are the challenges we have faced – Ahara has multiple sheets with unusual text added in the heading filenames are not following proper naming conversions, Bangalore north file needed manual intervention to clean.
PDS – Malnutrition data
Domain Name | Health |
Files Shared | 3 |
Sheets Shared | 3 |
Files Ingested | 3 |
Sheets Ingested | 3 |
Ingestion % | 100% |
Landing Tables | 4 |
Staging Tables | 3 |
Average Rating | ***** (Very Easy) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel |
LGD Code Included | YES |
Raw data S3 Path | rawdatamalnutrition/ |
Pipeline Path | PL_P0_MALNUTRITION.ipynb |
From given raw data we ingested 4 tables in landing, and we consolidated 3 tables to staging. Following are the challenges we have faced – empty columns and Different columns and diffrent spelling of taluk from that of LDG codes file
Karnataka at a Glance 2020-21
Domain Name | Health |
Files Shared | 16 |
Sheets Shared | 16 |
Files Ingested | 16 |
Sheets Ingested | 16 |
Ingestion % | 100% |
Landing Tables | 0 |
Staging Tables | 11 |
Average Rating | *** (Medium) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel.xls |
LGD Code Included | NO |
Raw data S3 Path | KAG_2020_21/ |
Pipeline Path | KAG_2020_21/ |
From given raw data we consolidated 11 tables to staging. Following are the challenges we have faced – Multi header titles, Separation of Kannada words from English
Karnataka at a Glance 2020-21
Domain Name | Education |
Files Shared | 23 |
Sheets Shared | 23 |
Files Ingested | 23 |
Sheets Ingested | 23 |
Ingestion % | 100% |
Landing Tables | 0 |
Staging Tables | 15 |
Average Rating | *** (Medium) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel.xls |
LGD Code Included | NO |
Raw data S3 Path | KAG_2020_21/ |
Pipeline Path | KAG_2020_21/ |
From given raw data we consolidated 15 tables to staging. Following are the challenges we have faced – Multi header titles, Separation of Kannada words from English.
SECC
Domain Name | Education |
Files Shared | Excel – 7 CSV – 31 |
Sheets Shared | Excel – 7 CSV – 31 |
Files Ingested | Excel – 7 CSV – 31 |
Sheets Ingested | Excel – 7 CSV – 31 |
Ingestion % | 100% |
Landing Tables | 40 |
Staging Tables | 7 |
Average Rating | Excel – **** (Easy) CSV – **** (Easy) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel, .csv |
LGD Code Included | YES |
Raw data S3 Path | SECC/ |
Pipeline Path | SECC/ |
From given raw data we ingested 40 tables in landing, and we consolidated 7 tables to staging. Following are the challenges we have faced – large file sizes, multi-level titles, Haveri file had records taking multiple lines, some of columns are completely null need to add some default value
Karnataka at a Glance 2020-21
Domain Name | Agriculture |
Files Shared | 16 |
Sheets Shared | 16 |
Files Ingested | 16 |
Sheets Ingested | 16 |
Ingestion % | 100% |
Landing Tables | 0 |
Staging Tables | 19 |
Average Rating | *** (Medium) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel.xls |
LGD Code Included | NO |
Raw data S3 Path | KAG_2020_21/ |
Pipeline Path | KAG_2020_21/ |
From given raw data we consolidated 19 tables to staging. Following are the challenges we have faced – Multi header titles, Separation of Kannada words from English.
CCE data of Directorate of Economics and Statistics (DES)
Domain Name | Agriculture |
Files Shared | 17 |
Sheets Shared | 32 |
Files Ingested | 17 |
Sheets Ingested | 32 |
Ingestion % | 100% |
Landing Tables | 33 |
Staging Tables | 1 |
Average Rating | **** (Very Easy) |
Processing Error Rate | 0% |
Record Error Rate | In Progress |
File Format | Excel |
LGD Code Included | NO |
Raw data S3 Path | CCE/ |
Pipeline Path | PL_P0.0_CCE.ipynb |
From given raw data we ingested 33 tables in landing, and we consolidated 1 table to staging. Following are the challenges we have faced – Long column names empty columns and multiple spaces.
Fertilizers data 2014-18
Domain Name | Agriculture |
Files Shared | 2 |
Sheets Shared | 13 |
Files Ingested | 2 |
Sheets Ingested | 13 |
Ingestion % | 100% |
Landing Tables | 11 |
Staging Tables | 6 |
Average Rating | 3* (Medium) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel |
LGD Code Included | YES |
Raw data S3 Path | Fertilisers data 2014-2018/ |
Pipeline Path | Fertilizer/ |
From given raw data we ingested 11 tables in landing, and we consolidated 6 tables to staging. Following are the challenges we have faced – Multi level headers
Irrigation district wise 1954-2018
Domain Name | Agriculture |
Files Shared | 1 |
Sheets Shared | 1 |
Files Ingested | 1 |
Sheets Ingested | 1 |
Ingestion % | 100% |
Landing Tables | 1 |
Staging Tables | 2 |
Average Rating | **** (Easy) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel |
LGD Code Included | YES |
Raw data S3 Path | Irrigation – Districtwise/ |
Pipeline Path | PL_0_Irrigation_districtwise_data_2019.ipynb |
From given raw data we ingested 1 table in landing, and we consolidated 2 tables to staging. Following are the challenges we have faced – Multi level headers
Irrigation taluk wise
Domain Name | Agriculture |
Files Shared | 2 |
Sheets Shared | 2 |
Files Ingested | 2 |
Sheets Ingested | 2 |
Ingestion % | 100% |
Landing Tables | 1 |
Staging Tables | 2 |
Average Rating | **** (Easy) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel |
LGD Code Included | YES |
Raw data S3 Path | Irrigation – Talukwise/ |
Pipeline Path | PL_0_Irrigation_Talukwise_data_2018_2019.ipynb |
From given raw data we ingested 1 table in landing, and we consolidated 2 tables to staging. Following are the challenges we have faced – Multi level headers
Geographical land use taluk wise 2017-2018
Domain Name | Agriculture |
Files Shared | 2 |
Sheets Shared | 2 |
Files Ingested | 2 |
Sheets Ingested | 2 |
Ingestion % | 100% |
Landing Tables | 0 |
Staging Tables | 1 |
Average Rating | **** (Easy) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel |
LGD Code Included | YES |
Raw data S3 Path | Geographical land Use Data Taluk Wise 2017 & 2018/ |
Pipeline Path | PL_0_geographical_Taluk_land_data_2017_2018.ipynb |
From given raw data we consolidated 1 table to staging. Following are the challenges we have faced – Multi level headers
Geographical land use district wise 2007-2018
Domain Name | Agriculture |
Files Shared | 1 |
Sheets Shared | 1 |
Files Ingested | 2 |
Sheets Ingested | 1 |
Ingestion % | 100% |
Landing Tables | 0 |
Staging Tables | 1 |
Average Rating | **** (Easy) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel |
LGD Code Included | YES |
Raw data S3 Path | Geographical Land Use data Distict wise 2007 – 2018/ |
Pipeline Path | PL_0_geographical_land_District_data.ipynb |
From given raw data we consolidated 1 table to staging. Following are the challenges we have faced – Multi level headers
Principal crops data
Domain Name | Agriculture |
Files Shared | pdf – 14 jpeg – 873 |
Sheets Shared | pdf – 2,400 jpeg – 873 |
Files Ingested | pdf – 0 jpeg – 50 |
Sheets Ingested | pdf – 0 jpeg – 50 |
Ingestion % | In Progress |
Landing Tables | pdf – 0 |
Staging Tables | pdf – 0 |
Average Rating | pdf – * (Very Difficult) jpeg – * (Very Difficult) |
Processing Error Rate | jpeg – 5% |
Record Error Rate | In Progress |
File Format | .pdf, .jpg |
LGD Code Included | NO |
Raw data S3 Path | Pdf- Agri_Principal_Crops PDF to_Excel/ Excel- Agri_Principal_Crops_Image_To_Excel/ |
Pipeline Path | Pdf- Principal_Crops_PDF/ Excel- Principal_Crops_ImagetoExcel/ |
From given images and pdf files we extracted the data. Following are the challenges we have faced – pdfs are converted to excel, image files are converted to excel. Still working on these. Some images are not being recognized well my AWS Textract service
Operation holdings area data
Domain Name | Agriculture |
Files Shared | 207 |
Sheets Shared | 207 |
Files Ingested | 207 |
Sheets Ingested | 207 |
Ingestion % | 100% |
Landing Tables | 1031 |
Staging Tables | 2 |
Average Rating | * (Very Difficult) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel |
LGD Code Included | YES |
Raw data S3 Path | Operation Holdings Area Data/ |
Pipeline Path | Agri_Operation/ |
From given raw data we ingested 1031 tables in landing, and we consolidated 2 tables to staging. Following are the challenges we have faced – Multiple Tables per sheet
Time Series Area, Production Yield data – District wise
Domain Name | Agriculture |
Files Shared | 3 |
Sheets Shared | 93 |
Files Ingested | 3 |
Sheets Ingested | 93 |
Ingestion % | 100% |
Landing Tables | 90 |
Staging Tables | 4 |
Average Rating | ** (Difficult) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel |
LGD Code Included | YES |
Raw data S3 Path | Time Series Area, Production, Yield Data District-wise/ |
Pipeline Path | TIME_SERIES/ |
From given raw data we ingested 90 tables in landing, and we consolidated 4 tables to staging. Following are the challenges we have faced – Multi level headers
New data received from Agriculture dept
Domain Name | Agriculture |
Files Shared | 12 |
Sheets Shared | 46 |
Files Ingested | 6 |
Sheets Ingested | 42 |
Ingestion % | 82.75% |
Landing Tables | 18 |
Staging Tables | 0 |
Average Rating | *** (Medium) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel |
LGD Code Included | NO |
Raw data S3 Path | New Data Received from Agriculture Dep/ |
Pipeline Path | PL_P0.0_NDRA_DEPT.ipynb |
From given raw data we ingested 18 tables in landing. Following are the challenges we have faced – multi headers corrupt text nonaligned columns
Other Agriculture related data By Dr.Manjunath\Agri. Census 2015-16
Domain Name | Agriculture |
Files Shared | 1 |
Sheets Shared | 30 |
Files Ingested | 1 |
Sheets Ingested | 30 |
Ingestion % | 100% |
Landing Tables | 155 |
Staging Tables | 1 |
Average Rating | ** (Difficult) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel |
LGD Code Included | YES |
Raw data S3 Path | Data Received from Manju Nath sir 25-01-2021/ |
Pipeline Path | Agri_Census/ |
From given raw data we ingested 155 tables in landing, and we consolidated 1 table to staging. Following are the challenges we have faced – Multiple tables in worksheets
Crop Cutting Experiment Data (DAY TAY SAY)
Domain Name | Agriculture |
Files Shared | 3 |
Sheets Shared | 41 |
Files Ingested | 3 |
Sheets Ingested | 41 |
Ingestion % | 100% |
Landing Tables | 7 |
Staging Tables | 0 |
Average Rating | * (Very Difficult) |
Processing Error Rate | 10% |
Record Error Rate | In Progress |
File Format | .TXT |
LGD Code Included | NO |
Raw data S3 Path | Agri_Crop_Cutting_Text_to_Excel/ |
Pipeline Path | KRS_TEXTfiles/ |
From given raw data we ingested 7 tables in landing. Following are the challenges we have faced – Multiple tables per text file
Crop Cutting Data\Data
Domain Name | Agriculture |
Files Shared | 42 |
Sheets Shared | 42 |
Files Ingested | 33 |
Sheets Ingested | 33 |
Ingestion % | 78.57% |
Landing Tables | 33 |
Staging Tables | 0 |
Average Rating | 4* (Easy) |
Processing Error Rate | 10% |
Record Error Rate | In Progress |
File Format | Excel |
LGD Code Included | NO |
Raw data S3 Path | Data/ |
Pipeline Path | AGRICULTURE_DATA/ |
From given raw data we ingested 33 tables in landing. Following are the challenges we have faced – Multiple tables per sheet, no proper formatting of column names some files have unusual formats i.e., why not yet ingested error rate will decrease when we ingest them.
KRS All Year 2016-2018-19
Domain Name | Agriculture |
Files Shared | 18 |
Sheets Shared | 18 |
Files Ingested | 18 |
Sheets Ingested | 18 |
Ingestion % | 100% |
Landing Tables | 18 |
Staging Tables | 0 |
Average Rating | * (Very Difficult) |
Processing Error Rate | 0% |
Record Error Rate | In Progress |
File Format | |
LGD Code Included | NO |
Raw data S3 Path | Agri_Crop_Cutting_KRS PDF_to_Excel/ |
Pipeline Path | KRS_PDF/ |
From given raw data we ingested 18 tables in landing. Following are the challenges we have faced – Multiple Tables per pdf files
Master data files
Domain Name | Master Files |
Files Shared | 42 |
Sheets Shared | 42 |
Files Ingested | 42 |
Sheets Ingested | 42 |
Ingestion % | 100% |
Landing Tables | 0 |
Staging Tables | 42 |
Average Rating | ***** (Very Easy) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel |
LGD Code Included | YES |
Raw data S3 Path | PL_P0_master_raw/ |
Pipeline Path | master/ |
From given raw data we consolidated 42 tables to staging. Following are the challenges we have faced – Long column names empty columns and multiple white spaces.
Raw data files (Batch 1)
Domain Name | Master Files |
Files Shared | 13 |
Sheets Shared | 15 |
Files Ingested | 13 |
Sheets Ingested | 15 |
Ingestion % | 100% |
Landing Tables | 0 |
Staging Tables | 14 |
Average Rating | **** (Easy) |
Processing Error Rate | 0% |
Record Error Rate | 0% |
File Format | Excel |
LGD Code Included | YES |
Raw data S3 Path | PL_P0_master_raw/ |
Pipeline Path | raw/ |
From given raw data we consolidated 14 tables to staging. Following are the challenges we have faced – Long column names empty columns and multiple spaces