# split-rs-data **Repository Path**: yharby/split-rs-data ## Basic Information - **Project Name**: split-rs-data - **Description**: Divide remote sensing images and their labels into data sets of specified size. - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-02-12 - **Last Updated**: 2022-02-12 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/Youssef-Harby/split-rs-data/main?labpath=Tutorial.ipynb) [![CodeFactor](https://www.codefactor.io/repository/github/youssef-harby/split-rs-data/badge)](https://www.codefactor.io/repository/github/youssef-harby/split-rs-data) # Creating tools to handle raster and vector data to split it into small pieces equaled in size for machine learning datasets ## How To Use - Install docker https://docs.docker.com/engine/install/ (macos, Windows or Linux) - Clone the Repository : `git clone https://github.com/Youssef-Harby/split-rs-data.git` - Go to project directory : `cd split-rs-data` - Copy and paste your raster(.tif) and vector(.shp) files into a seperated folders : - ``` ./split-rs-data/DataSet/ # Dataset root directory |--raster # Original raster data | |--xxx1.tif (xx1.png) | |--... | └--... | |--vector # All shapefiles in the same place (.shx, .shp..etc) | |--xxx1.shp | |--xxx1.shx / .prj / .cpg / .dbf .... | └--xxx2.shp ``` - Build the docke image : ```docker compose up --build ``` - go to http://127.0.0.1:8888/ - you will find your token in the cli of the image. ![](https://github.com/Youssef-Harby/Remote-sensing-building-extraction-to-3D-model-using-Paddle-and-Grasshopper/blob/main/md_images/jnb-token.png?raw=true) - Open Tutorial.ipynb to learn - Or define your vector and raster folders in multi_raster_vector.py file and run it in docker by open cli and type : `python multi_raster_vector.py` ## TODO - [x] Creating Docker Image for development env. - [x] Splitting raster data into equal pieces with [rasterio](https://github.com/rasterio/rasterio) (512×512) thanks to [@geoyee](https://github.com/geoyee). - [x] Splitting raster data into equal pieces with [GDAL](https://github.com/OSGeo/gdal) , https://gdal.org/. - [x] Rasterize shapefile to raster in the same satellite pixel size and projection. - [x] Convert 24 or 16 bit raster to 8 bit. - [x] Export as jpg (for raster) and png (for rasterized shapefile) with GDAL. - [X] Validation of training and testing datasets for paddlepaddle. - [ ] GUI - [X] QGIS Plugin ➡️ [Deep Learning Datasets Maker](https://github.com/deepbands/deep-learning-datasets-maker/) ![](/docs/images/vQ9YMn.gif) ## Code In Detail ⬇️ ## First - Prepareing Datasets # 1.Convert Vector to Raster (Rasterize) with reference coordinate system from raster tiff all these tools made for prepare data for paddlepaddlea. ```python from osgeo import gdal, ogr ``` - fn\_ras = Input raster data (GTiff) - fn\_vec = input vector data (Shapefile) ```python fn_ras = 'DataSet/raster/01/01.tif' fn_vec = 'DataSet/vector/01/01.shp' output = 'DataSet/results/lab_all_values.tif' ``` import the GDAL driver "ESRI Shapefile" to open the shapefile ```python driver = ogr.GetDriverByName("ESRI Shapefile") ``` open raster and shapefile datasets with (shapefile , 1) - (shapefile , 1) read and write in the shapefile - (shapefile , 0) read onle the shapefile ```python ras_ds = gdal.Open(fn_ras) vec_ds = driver.Open(fn_vec, 1) ``` Get the : - GetLayer (Only shapefiles have one lyrs other fomates maybe have multi-lyrs) \#VECTOR - GetGeoTransform \#FROM RASTER - GetProjection \#FROM RASTER ```python lyr = vec_ds.GetLayer() geot = ras_ds.GetGeoTransform() proj = ras_ds.GetProjection() # Get the projection from original tiff (fn_ras) geot ``` (342940.8074133941, 0.18114600000000536, 0.0, 3325329.401211367, 0.0, -0.1811459999999247) Open the shapefile feature to edit in it ```python layerdefinition = lyr.GetLayerDefn() feature = ogr.Feature(layerdefinition) ``` `feature.GetFieldIndex` make you to know the id of a specific field name you want to read/edit/delete - Also you can list all fields on the shapefile by : schema = [] for n in range(layerdefinition.GetFieldCount()): fdefn = layerdefinition.GetFieldDefn(n) schema.append(fdefn.name) - Then I will delete the field called "MLDS" has been assumed by me ```python yy = feature.GetFieldIndex("MLDS") if yy < 0: print("MLDS field not found, we will create one for you and make all values to 1") else: lyr.DeleteField(yy) ``` add new field to the shapefile with a default value `"1"` and don't forget to close feature after the edits ```python new_field = ogr.FieldDefn("MLDS", ogr.OFTInteger) lyr.CreateField(new_field) for feature in lyr: feature.SetField("MLDS", 1) lyr.SetFeature(feature) feature = None ``` Set the projection from original tiff (fn\_ras) to the rasterized tiff ```python drv_tiff = gdal.GetDriverByName("GTiff") chn_ras_ds = drv_tiff.Create( output, ras_ds.RasterXSize, ras_ds.RasterYSize, 1, gdal.GDT_Byte) chn_ras_ds.SetGeoTransform(geot) chn_ras_ds.SetProjection(proj) chn_ras_ds.FlushCache() ``` ```python gdal.RasterizeLayer(chn_ras_ds, [1], lyr, burn_values=[1], options=["ATTRIBUTE=MLDS"]) chn_ras_ds = None vec_ds = None ``` DONE ## Second - Splitting raster and rasterized files to small tiles 512×512 depends on your memory ```python ds = gdal.Open(fn_ras) gt = ds.GetGeoTransform() ``` get coordinates of upper left corner ```python xmin = gt[0] ymax = gt[3] resx = gt[1] res_y = gt[5] resy = abs(res_y) ``` ```python import math import os.path as osp ``` the tile size i want (may be 256×256 for smaller memory size) ```python needed_out_x = 512 needed_out_y = 512 ``` round up to the nearest int ```python xnotround = ds.RasterXSize / needed_out_x xround = math.ceil(xnotround) ynotround = ds.RasterYSize / needed_out_y yround = math.ceil(ynotround) print(xnotround) print(xround) print(ynotround) print(yround) ``` 9.30078125 10 5.689453125 6 pixel to meter - 512×10×0.18 ```python pixtomX = needed_out_x * xround * resx pixtomy = needed_out_y * yround * resy print (pixtomX) print (pixtomy) ``` 927.4675200000274 556.4805119997686 size of a single tile ```python xsize = pixtomX / xround ysize = pixtomy / yround print (xsize) print (ysize) ``` 92.74675200000274 92.74675199996143 create lists of x and y coordinates ```python xsteps = [xmin + xsize * i for i in range(xround + 1)] ysteps = [ymax - ysize * i for i in range(yround + 1)] xsteps ``` [342940.8074133941, 343033.5541653941, 343126.3009173941, 343219.0476693941, 343311.7944213941, 343404.54117339413, 343497.28792539414, 343590.03467739414, 343682.78142939415, 343775.5281813941, 343868.2749333941] set the output path ```python cdpath = "DataSet/image/" ``` loop over min and max x and y coordinates ```python for i in range(xround): for j in range(yround): xmin = xsteps[i] xmax = xsteps[i + 1] ymax = ysteps[j] ymin = ysteps[j + 1] # gdal translate to subset the input raster gdal.Translate(osp.join(cdpath, \ (str("01") + "-" + str(j) + "-" + str(i) + "." + "jpg")), ds, projWin=(abs(xmin), abs(ymax), abs(xmax), abs(ymin)), xRes=resx, yRes=-resy, outputType=gdal.gdalconst.GDT_Byte, format="JPEG") ds = None ``` ### Third - Spilit Custom Dataset and Generate File List For all data that is not divided into training set, validation set, and test set, PaddleSeg provides a script to generate segmented data and generate a file list. #### Use scripts to randomly split the custom dataset proportionally and generate a file list The data file structure is as follows: ./DataSet/ # Dataset root directory |--image # Original image catalog | |--xxx1.jpg (xx1.png) | |--... | └--... | |--label # Annotated image catalog | |--xxx1.png | |--... | └--... Among them, the corresponding file name can be defined according to needs. The commands used are as follows, which supports enabling specific functions through different Flags. python tools/split_dataset_list.py ${FLAGS} Parameters: - dataset\_root: Dataset root directory - images\_dir\_name: Original image catalog - labels\_dir\_name: Annotated image catalog FLAGS: | FLAG | Meaning | Default | Parameter numbers | | --------------- | ------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------- | ----------------- | | \--split | Dataset segmentation ratio | 0.7 0.3 0 | 3 | | \--separator | File list separator | " | " | | \--format | Data format of pictures and label sets | "jpg" "png" | 2 | | \--label\_class | Label category | '\_\_background\_\_' '\_\_foreground\_\_' | several | | \--postfix | Filter pictures and label sets according to whether the main file name (without extension) contains the specified suffix | "" ""(2 null characters) | 2 | After running, `train.txt`, `val.txt`, `test.txt` and `labels.txt` will be generated in the root directory of the dataset. **Note:** Requirements for generating the file list: either the original image and the number of annotated images are the same, or there is only the original image without annotated images. If the dataset lacks annotated images, a file list without separators and annotated image paths will be generated. #### Example python tools/split_dataset_list.py images annotations --split 0.6 0.2 0.2 --format jpg png ## Dataset file organization - If you need to use a custom dataset for training, it is recommended to organize it into the following structure: custom\_dataset | |--images | |--image1.jpg | |--image2.jpg | |--... | |--labels | |--label1.png | |--label2.png | |--... | |--train.txt | |--val.txt | |--test.txt The contents of train.txt and val.txt are as follows: image/image1.jpg label/label1.png image/image2.jpg label/label2.png ... Full Docs : ```python import sys import subprocess ``` ```python theproc = subprocess.Popen([ "python", r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\split_dataset_list.py", #Split text py script r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\DataSet", # Root DataSet ath r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\DataSet\image", #images path r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\DataSet\label", # "--split", # "0.6", # 60% training # "0.2", # 20% validating # "0.2", # 20% testing "--format", "jpg", "png"]) theproc.communicate() ``` (None, None)