UHIndicPCwS Dataset:

UHIndicPCwS(University of Hyderabad Indic Printed Character with Style) is a printed character dataset that contains characters from six different scripts -Tamil, Telugu, Kannada, Malayalam, Gujarati and Odia. It is available in the pickle file- UHIndicPCwS.pkl.

UHIndicPCwS.pkl:

It is a python pickle file that contains a dictionary with 6 keys which represent the six scripts for which the data is available:{Malayalam:[X, Y], Telugu:[X, Y], Odia:[X, Y], Gujarati:[X, Y], Tamil:[X, Y], Kannada:[X, Y]}. Each dictionary value is a set of training samples and the corresponding class labels.
- X is a list of numpy arrays which represent the image file.
- Y is a list of class labels of the corresponding image file.
Class label is denoted in the format of scriptname_classnumber(Ex: Malayalam_1).

The following python code snippet can be used to retrieve the images and class labels of a particular script.


import pickle
import numpy as np
#Replace with the key value of the script for which the data should be retrieved
script_of_interest ="Tamil" 
#provide the pickle file with path
file_name='./UHIndicPCwS.pkl'
with open(file_name,"rb") as pickle_out:
    lang_data=pickle.load(pickle_out)

for key,val in lang_data.items():
    if(key == script_of_interest):
        image_data=val[0]	#images as a list of numpy array
        classlabel_data=val[1]  #class labeles of the images 
        print("No of images:",len(image_data))
        break