Let's Discuss
Enquire NowAn artificial Neural Network is a group of interconnected neurons it mimics as a human brain. Recurrent neural networks(RNN) is a type of neural network which captures sequential data from the input. Sequential data is an interdependent stream of data, examples like time series data, language translation etc.
In RNN, it takes both input and previous values, for example, in a language translation, we have to know what is the current word and previous word to predict the next word. It can be used for generative models as well as Predictive models (making predictions). RNN’s can learn the sequence of a problem and then generate entirely new sequences for the problem domain.
Lets create a generative model for patient records in a hospital which masks all sensitive information using RNN.
The most basic thing that we need to create a Predictive/Generative model is a Dataset. The output of the model or the sequences that the model generates will be dependent on the dataset we provide. The more amount of data that is available, increases the overall generation throughput.
Here we are using a dataset with patient records that includes their MRN, SSN and other sensitive information. So our model should create a similar document as our second stage output.
A small part of the dataset that we used is in the image above.
The IDE that we are going to use through this entire post is PyCharm.
- Develop LSTM Recurrent neural network for training generative model.
– Import the classes and functions that we considered to use to train our model.
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
– Then, to decrease the vocabulary that the network must learn, load the ASCII text for the dataset into memory and convert all the characters to lowercase.
# load ASCII text for the dataset and convert to lowercase
filename = “wonderland.txt”
raw_text = open(filename).read()
raw_text = raw_text.lower()
– Next prepare the data for modelling by the neural network. For that we need to convert all the characters to integer
# map unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
– The data has been loaded and we can now summarize the dataset
n_chars = len(raw_text)
n_vocab = len(chars)
print “Total Characters: “, n_chars
print “Total Vocab: “, n_vocab
Based on our requirement, we can define the training data for the network. and we have the flexibility to choose how to break up the text and how to expose it to the network during training.
We will split the text into subsequences with a fixed length, here an arbitrary length of 100 characters. We could easily split the data up by sentences and combine the shorter sequences and truncate the longer ones.
Each training pattern of the network consists of 100 time steps of one character (X) followed by one character output (y). While generating these sequences, we slide this window along the whole data, one character at a time, by allowing each character a chance to learn from the 100 characters that came before it (except the first 100 characters).
Here we are defining a single hidden LSTM layer with 256 memory units. The network uses dropout with a probability of 20, the output is a Dense layer using the softmax activation function, which outputs a probability prediction for each of the 47 characters between 0 to 1.
The problem is a single character classification problem with 47 classes and as such is defined as optimizing the log loss (cross entropy), here we are using the ADAM optimization algorithm for speed.
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation=’softmax’))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)
We are actually modelling the dataset to learn the probability of each character in a sequence.
We are not building the most accurate model of the training data dataset which would predict each character in the dataset perfectly. We are building a generalisation of the dataset that minimizes the loss function.
The network is very slow to train, so we will keep checkpoints for the entire training.
# define the checkpoint
filepath=”weights-improvement-{epoch:02d}-{loss:.4f}.hdf5″
checkpoint = ModelCheckpoint(filepath, monitor=’loss’, verbose=1, save_best_only=True, mode=’min’)
callbacks_list = [checkpoint]
The model can now be fitted to the data. Here we use a large batch size of 128 patterns and a minimal number of 20 epochs.
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)
The full code listing is as follows.
# Small LSTM Network to Generate Text
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
# load ascii text for dataset and convert to lowercase
filename = “wonderland.txt”
raw_text = open(filename).read()
raw_text = raw_text.lower()
# mapf unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print “Total Characters: “, n_chars
print “Total Vocab: “, n_vocab
# Prepare an integer-encoded dataset of input-output pairs
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars – seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i + seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print “Total Patterns: “, n_patterns
# change X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation=’softmax’))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)
# define the checkpoint
filepath=”weights-improvement-{epoch:02d}-{loss:.4f}.hdf5″
checkpoint = ModelCheckpoint(filepath, monitor=’loss’, verbose=1, save_best_only=True, mode=’min’)
callbacks_list = [checkpoint]
# fit the model
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)
After running the code, you will get the checkpoints file and we can delete all of them except the one with small loss.
The weight file that I got after running the training is ‘weights-improvement-19-1.9435.hdf5’.
For generating the zipcodes, mrn, ssn and other numerical fields, we need to train in the same way using another dataset.
- Generating text with LSTM network-
In this post we are using textgenrnn for generating text with the trained checkpoint file.
Textgenrnn is a Python 3 module on top of Keras/Tensorflow for creating char-rnn’s.
- Create a new python file and name it document_generator.
- Create instances of textgenrnn namely textgen_general & textgen_values.
- Initialise it using trained models from the first section as below
textgen_general.load(Path_to_patient_records_model)
textgen_values.load(Path_to_sensitive_info_model)
eg:
textgen_general(‘/home/emil/Documents/pycharm_projects/LSTM_Text_Generator/com/sample/text_generator/weights-improvement-16-2.1378.hdf5’)
textgen_values(‘/home/emil/Documents/pycharm_projects/LSTM_Text_Generator/com/dexlock/text_generator/weights-improvement-13-2.1278.hdf5’)
The above code initialises the textgenrnn instance with the pretrained model.
We need to create address field, ssn, mrn and patient names in the generated document.
We will highlight the fields containing this sensitive information.
Let’s initialise lists for keeping different section names.
Also add a list for keeping the different date formats which we use to parse date of birth.
sections = [‘Patient Name’, ‘DOB’, ‘SSN’, ‘MRN’, ‘DOS’, ‘Please’, ‘Sign’, ‘Address’, ‘Medical’]
patient_tag_options = [‘Patient’, ‘Patient Name’, ‘Re ‘, ‘re: ‘]
ssn_tag_options = [‘Social Security #’, ‘SSN’, ‘Social Security Number ‘, ‘SSN#’]
dob_tag_options = [‘Date of Birth’, ‘DOB’, ‘Birth Date’, ‘The patient was born on’, ‘Born’,
‘Date of Birth\n’, ‘DOB\n’, ‘Birth Date\n’, ‘The patient was born on\n’, ‘Born\n’]
mrn_tag_options = [‘Medical Record Number’, ‘MRN’, ‘MR Number’, ‘MR #’, ‘Record Number’, ‘Record #’,
‘Medical Record Number\n’, ”, ‘MRN\n’, ‘MR Number\n’, ‘MR #\n’, ‘Record Number\n’, ‘Record #\n’]
gender = [‘male’, ‘female’]
dob_valid_formats = [(‘%d’, ‘%m’, ‘%y’), (‘%d’, ‘%m’, ‘%Y’), (‘%d’, ‘%M’, ‘%Y’), (‘%D’, ‘%M’, ‘%Y’),
(‘%m’, ‘%d’, ‘%y’), (‘%m’, ‘%d’, ‘%Y’), (‘%M’, ‘%d’, ‘%Y’), (‘%M’, ‘%D’, ‘%Y’),
(‘%y’, ‘%d’, ‘%m’), (‘%Y’, ‘%d’, ‘%m’), (‘%Y’, ‘%d’, ‘%M’), (‘%Y’, ‘%D’, ‘%M’),
(‘%d’, ‘%b’, ‘%y’), (‘%d’, ‘%b’, ‘%Y’), (‘%d’, ‘%B’, ‘%Y’), (‘%D’, ‘%B’, ‘%Y’)]
Let’s provide a set of tag options to generate a shuffled template for the document.
The tags are randomly chosen for each document.
for doc_num in range(num_docs):
file_content = ”
file_name = ‘outfile-‘ + str(doc_num) + ‘.txt’
img_file_name = ‘outfile-‘ + str(doc_num) + ‘.png’
file = open(‘/home/emil/Documents/gen-files/’ + file_name, ‘w’)
section_content = textgen_general.generate(n=1, prefix=”Please”, temperature=random.uniform(0.1, 0.2),
return_as_list=True,
max_gen_length=100)
file_content += (section_content[0] + ‘\n’)
file.write(section_content[0] + ‘\n’)
random.shuffle(sections)
for section in sections:
space = ‘ ‘ * random.randint(1, 4)
rand_num = random.randint(1, 100)
if section == ‘Address’:
if rand_num % 5 == 0:
file.write(‘Address\n’)
section_content = textgen_general.generate(n=1, prefix=’Address’, temperature=random.uniform(0.3, 0.5)
, return_as_list=True,
max_gen_length=150)
file_content += (“Address” + space + section_content[0] + ‘\n’)
file.write(“Address” + space + section_content[0] + ‘\n’)
else:
section_content = textgen_general.generate(n=1, prefix=’Street’, temperature=random.uniform(0.75, 0.95),
return_as_list=True,
max_gen_length=50)
file_content += (“Address” + section_content[0] + ‘\n’)
file.write(“Address” + section_content[0] + ‘\n’)
section_content = textgen_general.generate(n=1, prefix=’City’, temperature=random.uniform(0.75, 0.95),
return_as_list=True,
max_gen_length=50)
file_content += (‘City’ + section_content[0] + ‘\n’)
file.write(‘City’ + section_content[0] + ‘\n’)
section_content = textgen_general.generate(n=1, prefix=’State’, temperature=random.uniform(0.75, 0.95),
return_as_list=True,
max_gen_length=50)
file_content += (‘City’ + section_content[0] + ‘\n’)
file.write(‘City’ + section_content[0] + ‘\n’)
section_content = textgen_general.generate(n=1, prefix=’Zipcode’,
temperature=random.uniform(0.75, 0.95), return_as_list=True,
max_gen_length=50)
separation_char = ‘\n’
if random.randint(0, 100) % 5 == 0:
separation_char = ‘\t’
file_content += (section_content[0] + separation_char)
file.write(section_content[0] + separation_char)
else:
section_content = textgen_values.generate(n=1, prefix=section, temperature=random.uniform(0.75, 0.95),
return_as_list=True,
max_gen_length=75)
(replaced_string, subst_value) = replace_value(section_content[0], section)
file_content += (replaced_string + ‘\n’)
file.write(replaced_string + (‘ ‘ * random.randint(1, 4)) +
(‘\n’ * random.randint(0, 2)))
if section in [‘DOB’, ‘Patient Name’, ‘SSN’, ‘MRN’] and subst_value != ”:
key_file.write(file_name + ‘,’ + section + ‘,’ + subst_value + ‘\n’)
if rand_num % 2 == 0:
filler_content = textgen_general.generate(n=random.randint(1, 4), prefix=’The’,
temperature=random.uniform(0.75, 0.95),
return_as_list=True, max_gen_length=100)
for filler_element in filler_content:
suffix_chars = (‘\n’ * random.randint(1, 2))
file_content += (filler_element + suffix_chars)
file.write(filler_element + suffix_chars)
key_file.flush()
file.close()
This function will create text files containing all the generated text from textgenrnn and is formatted to include address and other information.
textgen_general.generate() function generates section names for the document.
Providing return_as_list = True , maxgen_length = 100 returns the content as list and also limits the length of the generated section names.
textgen_values.generate() function generates values for the zipcode field.
The below function masks the sensitive information from the document.
def replace_value(str_value, section):
to_str = ”
if section == ‘Patient Name’ or ‘<Nome-Paziente-Value>’ in str_value:
random.shuffle(gender)
random.shuffle(patient_tag_options)
to_str = patient_tag_options[0]
if random.randint(0, 10) % 2 == 0:
to_str = to_str + ‘\n’
str_value.replace(‘Patient Name’, to_str)
to_str = names.get_full_name(gender=gender[0])
str_value = str_value.replace(‘<Nome-Paziente-Value>’, to_str)
if section != ‘Patient Name’ or to_str not in str_value:
to_str = ”
elif section == ‘DOB’ or ‘<Nascita-Value>’ in str_value:
random.shuffle(dob_tag_options)
to_str = dob_tag_options[0]
if random.randint(0, 10) % 2 == 0:
to_str = to_str + ‘\n’
str_value.replace(‘DOB’, to_str)
random.shuffle(dob_separator)
random.shuffle(dob_valid_formats)
dob_datetime = datetime.strptime(’11/20/1987′, ‘%m/%d/%Y’)
format = dob_valid_formats[0][0] + dob_separator[0] + dob_valid_formats[0][1] + dob_separator[0] + \
dob_valid_formats[0][2]
to_str = dob_datetime.strftime(format)
str_value = str_value.replace(‘<Nascita-Value>’, to_str)
if section != ‘DOB’ or to_str not in str_value:
to_str = ”
elif section == ‘SSN’ or ‘<Sicurezza-Sociale-Value>’ in str_value:
random.shuffle(ssn_tag_options)
to_str = ssn_tag_options[0]
if random.randint(0, 10) % 2 == 0:
to_str = to_str + ‘\n’
str_value.replace(‘SSN’, to_str)
random.shuffle(ssn_separator)
separator = ssn_separator[0]
third = str(random.randint(1000, 9999))
if random.randint(0, 10) % 2 == 0:
first = str(random.randint(100, 999))
mid = str(random.randint(10, 99))
else:
random.shuffle(ssn_mask_char)
first = ssn_mask_char[0] * 3
if random.randint(0, 10) % 2 == 0:
mid = ssn_mask_char[0] * 3
else:
mid = ssn_mask_char[0] * 2
to_str = first + separator + mid + separator + third
str_value = str_value.replace(‘<Sicurezza-Sociale-Value>’, to_str)
if section != ‘SSN’ or to_str not in str_value:
to_str = ”
elif section == ‘MRN’:
if random.randint(0, 10) % 2 == 0:
random.shuffle(mrn_tag_options)
to_str = mrn_tag_options[0]
if random.randint(0, 10) % 2 == 0:
to_str = to_str + ‘\n’
else:
to_str = to_str + ‘ ‘
if random.randint(0, 10) % 2 == 0:
random.shuffle(mrn_separator)
separator = ssn_separator[0]
first = str(random.randint(10, 99))
second = str(random.randint(10, 99))
third = str(random.randint(100, 999))
value = (first + separator + second + separator + third)
str_value.replace(‘MRN’, to_str + value)
to_str = value
else:
str_value.replace(‘MRN’, to_str)
to_str = ”file = open(‘/home/emil/Documents/gen-files/’ + file_name, ‘w’)
if section != ‘MRN’ or to_str not in str_value:
to_str = ”
return (str_value, to_str)
We are shuffling the tags for each section so that the documents generated won’t be identical.
After all these steps, we will have the generated text formatted and ready to be converted into a document.
Now to convert the formatted text into a document, we need to use ImageDraw from the PIL package.
We are underlining words like ssn, mrn, dob, patient name in the document.
Draw.line() function draws lines and draw.text() function draws text in the document.
def text2png(text, fullpath, color=”#000″, bgcolor=”#FFF”, fontfullpath=None, fontsize=13, leftpadding=25,
rightpadding=3, width=800):
REPLACEMENT_CHARACTER = u’\uFFFD’
NEWLINE_REPLACEMENT_STRING = ‘ ‘ + REPLACEMENT_CHARACTER + ‘ ‘
# prepare linkback
linkback = “”
fontlinkback = ImageFont.truetype(‘/home/emil/fonts/arial.ttf’, 8)
linkbackx = fontlinkback.getsize(linkback)[0]
linkback_height = fontlinkback.getsize(linkback)[1]
# end of linkback
font = ImageFont.truetype(‘/home/emil/fonts/arial.ttf’, 13)
text = text.replace(‘\n’, NEWLINE_REPLACEMENT_STRING)
title_font = ImageFont.truetype(‘/home/emil/fonts/arial.ttf’, 24)
sub_title = ImageFont.truetype(‘/home/emil/fonts/arial.ttf’, 14)
lines = []
line = u””
for word in text.split():
if word == REPLACEMENT_CHARACTER: # give a blank line
lines.append(line[1:]) # slice the white space in the beginning of the line
line = u””
lines.append(u””) # the blank line
elif font.getsize(line + ‘ ‘ + word)[0] <= (width – rightpadding – leftpadding):
line += ‘ ‘ + word
else: # start a new line
lines.append(line[1:]) # slice the white space in the beginning of the line
line = u””
line += ‘ ‘ + word # for now, assume no word alone can exceed the line width
if len(line) != 0:
lines.append(line[1:]) # add the last line
line_height = font.getsize(text)[1]
img_height = line_height * (len(lines) + 1)
img = Image.new(“RGBA”, (width, img_height), bgcolor)
draw = ImageDraw.Draw(img)
draw.text((leftpadding + 200, 30), “FAKE DATA “, color, font=title_font)
draw.line((0, 60, width, 60), fill=(0, 0, 0, 128), width=5)
y = 80
has_ssn = False
has_mrn = False
for line in lines:
draw.text((leftpadding, y), line, color, font=font)
y += line_height
if ‘SSN’ in line and regexp.search(line):
has_ssn = True
if ‘MRN’ in line and regexp.search(line):
has_mrn = True
if ‘SSN’ in line or ‘MRN’ in line or ‘DOB’ in line or ‘Patient’ in line:
if random.randint(0, 100) % 3 == 0:
draw.line((leftpadding, y – 1, leftpadding + width, y – 1), fill=(random.randint(0, 255),
random.randint(0, 255),
random.randint(0, 255), 128), width=2)
elif random.randint(0, 100) % 10 == 0:
draw.line((leftpadding, y – 1, leftpadding + width, y – 1), fill=(random.randint(0, 255),
random.randint(0, 255),
random.randint(0, 255), 128), width=2)
if not has_mrn:
if random.randint(0, 10) % 3 == 0:
mrn_value = str(random.randint(10, 99)) + mrn_separator[0] + str(random.randint(10, 99)) + mrn_separator[
0] + str(random.randint(1000, 9999))
draw.text((leftpadding + width – 250, 20), mrn_tag_options[0] + ” ” + mrn_value,
color, font=sub_title)
key_file.write(file_name + ‘,MRN,’ + mrn_value + ‘\n’)
else:
draw.text((leftpadding + width – 170, 10), ‘Dr.’ + names.get_full_name(gender=gender[0]), color, font=sub_title)
draw.text((leftpadding + width – 170, 25), ‘Wellington Hospital’, color, font=sub_title)
draw.text((leftpadding + width – 170, 40), ‘Sterling, VA-22033’, color, font=sub_title)
draw.text((leftpadding, 40), ’20/11/2017′, color, font=font)
# add linkback at the bottom
# draw.text((width – linkbackx, img_height – linkback_height), linkback, color, font=fontlinkback)
# draw.line((0, 60, width, 60), fill=(0, 0, 0, 128), width=5)
draw.line((0, y – 10, width, y – 10), fill=(0, 0, 0, 128), width=10)
img.save(fullpath)
Running this python file named ‘document_generator’ will create documents on every iteration.
Let’s see one of the output document images created by the program.
This is one of the documents generated by the program, you can see that the document masks sensitive information. In this way, we can train data extractors from documents containing sensitive information without worrying about privacy.
Let’s Discuss
Have a dream project to start?
ContactWe use cookies to provide you with the best possible experience on our website. View our Cookies Policy for additional information on the cookies we use and how to manage your cookies choices.
Accept and Close