Using the neural network for fake document generation

May 9, 2022by, Akshara B

Uncategorized

An artificial Neural Network is a group of interconnected neurons it mimics as a human brain. Recurrent neural networks(RNN) is a type of neural network which captures sequential data from the input. Sequential data is an interdependent stream of data, examples like time series data, language translation etc.

 In RNN, it takes both input and previous values, for example, in a language translation, we have to know what is the current word and previous word to predict the next word. It can be used for generative models as well as Predictive models (making predictions). RNN’s can learn the sequence of a problem and then generate entirely new sequences for the problem domain.

Lets create a generative model for patient records in a hospital which masks all sensitive information using RNN.

The most basic thing that we need to create a Predictive/Generative model is a Dataset. The output of the model or the sequences that the model generates will be dependent on the dataset we provide. The more amount of data that is available, increases the overall generation throughput.

Here we are using a dataset with patient records that includes their MRN, SSN and other sensitive information. So our model should create a similar document as our second stage output.

A small part of the dataset that we used is in the image above.  

The IDE that we are going to use through this entire post is PyCharm.

  1. Develop LSTM Recurrent neural network for training generative model.

Import the classes and functions that we considered to use to train our model.

import numpy

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import Dropout

from keras.layers import LSTM

from keras.callbacks import ModelCheckpoint

from keras.utils import np_utils

Then, to decrease the vocabulary that the network must learn, load the ASCII text for the dataset into memory and convert all the characters to lowercase.

# load ASCII text for the dataset and convert to lowercase

filename = “wonderland.txt”

raw_text = open(filename).read()

raw_text = raw_text.lower()

Next prepare the data for modelling  by the neural network. For that we need to convert all the characters to integer

# map unique chars to integers

chars = sorted(list(set(raw_text)))

char_to_int = dict((c, i) for i, c in enumerate(chars))

The data has been loaded and we can now summarize the dataset

n_chars = len(raw_text)

n_vocab = len(chars)

print “Total Characters: “, n_chars

print “Total Vocab: “, n_vocab

Based on our requirement, we can define the training data for the network. and we have the flexibility to choose how to break up the text and how to expose it to the network during training.

We will split the text into subsequences with a fixed length, here an arbitrary length of 100 characters. We could easily split the data up by sentences and combine the shorter sequences and truncate the longer ones.

Each training pattern of the network consists of 100 time steps of one character (X) followed by one character output (y). While generating these sequences, we slide this window along the whole data, one character at a time, by allowing each character a chance to learn from the 100 characters that came before it (except the first 100 characters).

Here we are defining a single hidden LSTM layer with 256 memory units. The network uses dropout with a probability of 20, the output is a Dense layer using the softmax activation function,  which outputs a probability prediction for each of the  47 characters between 0 to 1.

The problem is  a single character classification problem with 47 classes and as such is defined as optimizing the log loss (cross entropy), here we are using the ADAM optimization algorithm for speed.

# define the LSTM model

model = Sequential()

model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))

model.add(Dropout(0.2))

model.add(Dense(y.shape[1], activation=’softmax’))

model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)

We are actually modelling the dataset to learn the probability of each character in a sequence.

We are not building the most accurate model of the training data dataset which would predict each character in the dataset perfectly. We are building a generalisation of the dataset that minimizes the loss function.

The network is very slow to train, so we will keep checkpoints for the entire training.

# define the checkpoint

filepath=”weights-improvement-{epoch:02d}-{loss:.4f}.hdf5″

checkpoint = ModelCheckpoint(filepath, monitor=’loss’, verbose=1, save_best_only=True, mode=’min’)

callbacks_list = [checkpoint]

The model can now be fitted to the data. Here we use a large batch size of 128 patterns and a minimal number of 20 epochs.

model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

The full code listing is as follows.

# Small LSTM Network to Generate Text 

import numpy

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import Dropout

from keras.layers import LSTM

from keras.callbacks import ModelCheckpoint

from keras.utils import np_utils

# load ascii text for dataset and convert to lowercase

filename = “wonderland.txt”

raw_text = open(filename).read()

raw_text = raw_text.lower()

# mapf unique chars to integers

chars = sorted(list(set(raw_text)))

char_to_int = dict((c, i) for i, c in enumerate(chars))

# summarize the loaded data

n_chars = len(raw_text)

n_vocab = len(chars)

print “Total Characters: “, n_chars

print “Total Vocab: “, n_vocab

# Prepare an integer-encoded dataset of input-output pairs

seq_length = 100

dataX = []

dataY = []

for i in range(0, n_chars – seq_length, 1):

seq_in = raw_text[i:i + seq_length]

seq_out = raw_text[i + seq_length]

dataX.append([char_to_int[char] for char in seq_in])

dataY.append(char_to_int[seq_out])

n_patterns = len(dataX)

print “Total Patterns: “, n_patterns

# change X to be [samples, time steps, features]

X = numpy.reshape(dataX, (n_patterns, seq_length, 1))

# normalize

X = X / float(n_vocab)

# one hot encode the output variable

y = np_utils.to_categorical(dataY)

# define the LSTM model

model = Sequential()

model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))

model.add(Dropout(0.2))

model.add(Dense(y.shape[1], activation=’softmax’))

model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)

# define the checkpoint

filepath=”weights-improvement-{epoch:02d}-{loss:.4f}.hdf5″

checkpoint = ModelCheckpoint(filepath, monitor=’loss’, verbose=1, save_best_only=True, mode=’min’)

callbacks_list = [checkpoint]

# fit the model

model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

After running the code, you will get the checkpoints file and we can delete all of them except the one with small loss.

The weight file that I got after running the training is ‘weights-improvement-19-1.9435.hdf5’.

For generating the zipcodes, mrn, ssn and other numerical fields, we need to train in the same way using another dataset.

  1. Generating text with LSTM network-

In this post we are using textgenrnn for generating text with the trained checkpoint file.

Textgenrnn is a Python 3 module on top of Keras/Tensorflow for creating char-rnn’s.

  1. Create a new python file and name it document_generator.
  2. Create instances of textgenrnn namely textgen_general & textgen_values.
  3. Initialise it using trained models from the first section as below

textgen_general.load(Path_to_patient_records_model)

textgen_values.load(Path_to_sensitive_info_model)

eg:

textgen_general(‘/home/emil/Documents/pycharm_projects/LSTM_Text_Generator/com/sample/text_generator/weights-improvement-16-2.1378.hdf5’)

textgen_values(‘/home/emil/Documents/pycharm_projects/LSTM_Text_Generator/com/dexlock/text_generator/weights-improvement-13-2.1278.hdf5’)

The above code initialises the textgenrnn instance with the pretrained model.

We need to create address field, ssn, mrn and patient names in the generated document.

We will highlight the fields containing this sensitive information.

Let’s initialise lists for keeping different section names.

Also add a list for keeping the different date formats which we use to parse date of birth.

sections = [‘Patient Name’, ‘DOB’, ‘SSN’, ‘MRN’, ‘DOS’, ‘Please’, ‘Sign’, ‘Address’, ‘Medical’]

patient_tag_options = [‘Patient’, ‘Patient Name’, ‘Re ‘, ‘re: ‘]

ssn_tag_options = [‘Social Security #’, ‘SSN’, ‘Social Security Number ‘, ‘SSN#’]

dob_tag_options = [‘Date of Birth’, ‘DOB’, ‘Birth Date’, ‘The patient was born on’, ‘Born’,

                   ‘Date of Birth\n’, ‘DOB\n’, ‘Birth Date\n’, ‘The patient was born on\n’, ‘Born\n’]

mrn_tag_options = [‘Medical Record Number’, ‘MRN’, ‘MR Number’, ‘MR #’, ‘Record Number’, ‘Record #’,

                   ‘Medical Record Number\n’, ”, ‘MRN\n’, ‘MR Number\n’, ‘MR #\n’, ‘Record Number\n’, ‘Record #\n’]

gender = [‘male’, ‘female’]

dob_valid_formats = [(‘%d’, ‘%m’, ‘%y’), (‘%d’, ‘%m’, ‘%Y’), (‘%d’, ‘%M’, ‘%Y’), (‘%D’, ‘%M’, ‘%Y’),

                     (‘%m’, ‘%d’, ‘%y’), (‘%m’, ‘%d’, ‘%Y’), (‘%M’, ‘%d’, ‘%Y’), (‘%M’, ‘%D’, ‘%Y’),

                     (‘%y’, ‘%d’, ‘%m’), (‘%Y’, ‘%d’, ‘%m’), (‘%Y’, ‘%d’, ‘%M’), (‘%Y’, ‘%D’, ‘%M’),

                     (‘%d’, ‘%b’, ‘%y’), (‘%d’, ‘%b’, ‘%Y’), (‘%d’, ‘%B’, ‘%Y’), (‘%D’, ‘%B’, ‘%Y’)]

Let’s provide a set of tag options to generate a shuffled template for the document.

The tags are randomly chosen for each document.

for doc_num in range(num_docs):

    file_content = ”

    file_name = ‘outfile-‘ + str(doc_num) + ‘.txt’

    img_file_name = ‘outfile-‘ + str(doc_num) + ‘.png’

    file = open(‘/home/emil/Documents/gen-files/’ + file_name, ‘w’)

    section_content = textgen_general.generate(n=1, prefix=”Please”, temperature=random.uniform(0.1, 0.2),

                                               return_as_list=True,

                                               max_gen_length=100)

    file_content += (section_content[0] + ‘\n’)

    file.write(section_content[0] + ‘\n’)

    random.shuffle(sections)

    for section in sections:

        space = ‘ ‘ * random.randint(1, 4)

        rand_num = random.randint(1, 100)

        if section == ‘Address’:

            if rand_num % 5 == 0:

                file.write(‘Address\n’)

                section_content = textgen_general.generate(n=1, prefix=’Address’, temperature=random.uniform(0.3, 0.5)

                                                           , return_as_list=True,

                                                           max_gen_length=150)

                file_content += (“Address” + space + section_content[0] + ‘\n’)

                file.write(“Address” + space + section_content[0] + ‘\n’)

            else:

                section_content = textgen_general.generate(n=1, prefix=’Street’, temperature=random.uniform(0.75, 0.95),

                                                           return_as_list=True,

                                                           max_gen_length=50)

                file_content += (“Address” + section_content[0] + ‘\n’)

                file.write(“Address” + section_content[0] + ‘\n’)

                section_content = textgen_general.generate(n=1, prefix=’City’, temperature=random.uniform(0.75, 0.95),

                                                           return_as_list=True,

                                                           max_gen_length=50)

                file_content += (‘City’ + section_content[0] + ‘\n’)

                file.write(‘City’ + section_content[0] + ‘\n’)

                section_content = textgen_general.generate(n=1, prefix=’State’, temperature=random.uniform(0.75, 0.95),

                                                           return_as_list=True,

                                                           max_gen_length=50)

                file_content += (‘City’ + section_content[0] + ‘\n’)

                file.write(‘City’ + section_content[0] + ‘\n’)

                section_content = textgen_general.generate(n=1, prefix=’Zipcode’,

                                                           temperature=random.uniform(0.75, 0.95), return_as_list=True,

                                                           max_gen_length=50)

                separation_char = ‘\n’

                if random.randint(0, 100) % 5 == 0:

                    separation_char = ‘\t’

                file_content += (section_content[0] + separation_char)

                file.write(section_content[0] + separation_char)

        else:

            section_content = textgen_values.generate(n=1, prefix=section, temperature=random.uniform(0.75, 0.95),

                                                      return_as_list=True,

                                                      max_gen_length=75)

            (replaced_string, subst_value) = replace_value(section_content[0], section)

            file_content += (replaced_string + ‘\n’)

            file.write(replaced_string + (‘ ‘ * random.randint(1, 4)) +

                       (‘\n’ * random.randint(0, 2)))

            if section in [‘DOB’, ‘Patient Name’, ‘SSN’, ‘MRN’] and subst_value != ”:

                key_file.write(file_name + ‘,’ + section + ‘,’ + subst_value + ‘\n’)

        if rand_num % 2 == 0:

            filler_content = textgen_general.generate(n=random.randint(1, 4), prefix=’The’,

                                                      temperature=random.uniform(0.75, 0.95),

                                                      return_as_list=True, max_gen_length=100)

            for filler_element in filler_content:

                suffix_chars = (‘\n’ * random.randint(1, 2))

                file_content += (filler_element + suffix_chars)

                file.write(filler_element + suffix_chars)

    key_file.flush()

    file.close()

This function will create  text files containing all the generated text from textgenrnn and is formatted to include address and other information. 

textgen_general.generate() function generates  section names for the document.

Providing return_as_list = True , maxgen_length = 100 returns the content as list and also limits the length of the generated section names.

textgen_values.generate() function generates values for the zipcode field.

The below function masks the sensitive information from the document.

def replace_value(str_value, section):

    to_str = ”

    if section == ‘Patient Name’ or ‘<Nome-Paziente-Value>’ in str_value:

        random.shuffle(gender)

        random.shuffle(patient_tag_options)

        to_str = patient_tag_options[0]

        if random.randint(0, 10) % 2 == 0:

            to_str = to_str + ‘\n’

        str_value.replace(‘Patient Name’, to_str)

        to_str = names.get_full_name(gender=gender[0])

        str_value = str_value.replace(‘<Nome-Paziente-Value>’, to_str)

        if section != ‘Patient Name’ or to_str not in str_value:

            to_str = ”

    elif section == ‘DOB’ or ‘<Nascita-Value>’ in str_value:

        random.shuffle(dob_tag_options)

        to_str = dob_tag_options[0]

        if random.randint(0, 10) % 2 == 0:

            to_str = to_str + ‘\n’

        str_value.replace(‘DOB’, to_str)

        random.shuffle(dob_separator)

        random.shuffle(dob_valid_formats)

        dob_datetime = datetime.strptime(’11/20/1987′, ‘%m/%d/%Y’)

        format = dob_valid_formats[0][0] + dob_separator[0] + dob_valid_formats[0][1] + dob_separator[0] + \

                 dob_valid_formats[0][2]

        to_str = dob_datetime.strftime(format)

        str_value = str_value.replace(‘<Nascita-Value>’, to_str)

        if section != ‘DOB’ or to_str not in str_value:

            to_str = ”

    elif section == ‘SSN’ or ‘<Sicurezza-Sociale-Value>’ in str_value:

        random.shuffle(ssn_tag_options)

        to_str = ssn_tag_options[0]

        if random.randint(0, 10) % 2 == 0:

            to_str = to_str + ‘\n’

        str_value.replace(‘SSN’, to_str)

        random.shuffle(ssn_separator)

        separator = ssn_separator[0]

        third = str(random.randint(1000, 9999))

        if random.randint(0, 10) % 2 == 0:

            first = str(random.randint(100, 999))

            mid = str(random.randint(10, 99))

        else:

            random.shuffle(ssn_mask_char)

            first = ssn_mask_char[0] * 3

            if random.randint(0, 10) % 2 == 0:

                mid = ssn_mask_char[0] * 3

            else:

                mid = ssn_mask_char[0] * 2

        to_str = first + separator + mid + separator + third

        str_value = str_value.replace(‘<Sicurezza-Sociale-Value>’, to_str)

        if section != ‘SSN’ or to_str not in str_value:

            to_str = ”

    elif section == ‘MRN’:

        if random.randint(0, 10) % 2 == 0:

            random.shuffle(mrn_tag_options)

            to_str = mrn_tag_options[0]

            if random.randint(0, 10) % 2 == 0:

                to_str = to_str + ‘\n’

            else:

                to_str = to_str + ‘ ‘

            if random.randint(0, 10) % 2 == 0:

                random.shuffle(mrn_separator)

                separator = ssn_separator[0]

                first = str(random.randint(10, 99))

                second = str(random.randint(10, 99))

                third = str(random.randint(100, 999))

                value = (first + separator + second + separator + third)

                str_value.replace(‘MRN’, to_str + value)

                to_str = value

            else:

                str_value.replace(‘MRN’, to_str)

                to_str = ”file = open(‘/home/emil/Documents/gen-files/’ + file_name, ‘w’)

            if section != ‘MRN’ or to_str not in str_value:

                to_str = ”

    return (str_value, to_str)

We are shuffling the tags for each section so that the documents generated won’t be identical.

After all these steps, we will have the generated text formatted and ready to be converted into a document.

Now to convert the formatted text into a document, we need to use ImageDraw from the PIL package.

We are underlining words like ssn, mrn, dob, patient name in the document. 

 Draw.line() function draws lines and draw.text() function draws text in the document. 

def text2png(text, fullpath, color=”#000″, bgcolor=”#FFF”, fontfullpath=None, fontsize=13, leftpadding=25,

             rightpadding=3, width=800):

    REPLACEMENT_CHARACTER = u’\uFFFD’

    NEWLINE_REPLACEMENT_STRING = ‘ ‘ + REPLACEMENT_CHARACTER + ‘ ‘

    # prepare linkback

    linkback = “”

    fontlinkback = ImageFont.truetype(‘/home/emil/fonts/arial.ttf’, 8)

    linkbackx = fontlinkback.getsize(linkback)[0]

    linkback_height = fontlinkback.getsize(linkback)[1]

    # end of linkback

    font = ImageFont.truetype(‘/home/emil/fonts/arial.ttf’, 13)

    text = text.replace(‘\n’, NEWLINE_REPLACEMENT_STRING)

    title_font = ImageFont.truetype(‘/home/emil/fonts/arial.ttf’, 24)

    sub_title = ImageFont.truetype(‘/home/emil/fonts/arial.ttf’, 14)

    lines = []

    line = u””

    for word in text.split():

        if word == REPLACEMENT_CHARACTER:  # give a blank line

            lines.append(line[1:])  # slice the white space in the beginning of the line

            line = u””

            lines.append(u””)  # the blank line

        elif font.getsize(line + ‘ ‘ + word)[0] <= (width – rightpadding – leftpadding):

            line += ‘ ‘ + word

        else:  # start a new line

            lines.append(line[1:])  # slice the white space in the beginning of the line

            line = u””

            line += ‘ ‘ + word  # for now, assume no word alone can exceed the line width

    if len(line) != 0:

        lines.append(line[1:])  # add the last line

    line_height = font.getsize(text)[1]

    img_height = line_height * (len(lines) + 1)

    img = Image.new(“RGBA”, (width, img_height), bgcolor)

    draw = ImageDraw.Draw(img)

    draw.text((leftpadding + 200, 30), “FAKE DATA “, color, font=title_font)

    draw.line((0, 60, width, 60), fill=(0, 0, 0, 128), width=5)

    y = 80

    has_ssn = False

    has_mrn = False

    for line in lines:

        draw.text((leftpadding, y), line, color, font=font)

        y += line_height

        if ‘SSN’ in line and regexp.search(line):

            has_ssn = True

        if ‘MRN’ in line and regexp.search(line):

            has_mrn = True

        if ‘SSN’ in line or ‘MRN’ in line or ‘DOB’ in line or ‘Patient’ in line:

            if random.randint(0, 100) % 3 == 0:

                draw.line((leftpadding, y – 1, leftpadding + width, y – 1), fill=(random.randint(0, 255),

                                                                                  random.randint(0, 255),

                                                                                  random.randint(0, 255), 128), width=2)

            elif random.randint(0, 100) % 10 == 0:

                draw.line((leftpadding, y – 1, leftpadding + width, y – 1), fill=(random.randint(0, 255),

                                                                                  random.randint(0, 255),

                                                                                  random.randint(0, 255), 128), width=2)

    if not has_mrn:

        if random.randint(0, 10) % 3 == 0:

            mrn_value = str(random.randint(10, 99)) + mrn_separator[0] + str(random.randint(10, 99)) + mrn_separator[

                0] + str(random.randint(1000, 9999))

            draw.text((leftpadding + width – 250, 20), mrn_tag_options[0] + ” ” + mrn_value,

                      color, font=sub_title)

            key_file.write(file_name + ‘,MRN,’ + mrn_value + ‘\n’)

    else:

        draw.text((leftpadding + width – 170, 10), ‘Dr.’ + names.get_full_name(gender=gender[0]), color, font=sub_title)

        draw.text((leftpadding + width – 170, 25), ‘Wellington Hospital’, color, font=sub_title)

        draw.text((leftpadding + width – 170, 40), ‘Sterling, VA-22033’, color, font=sub_title)

    draw.text((leftpadding, 40), ’20/11/2017′, color, font=font)

    # add linkback at the bottom

    # draw.text((width – linkbackx, img_height – linkback_height), linkback, color, font=fontlinkback)

    # draw.line((0, 60, width, 60), fill=(0, 0, 0, 128), width=5)

    draw.line((0, y – 10, width, y – 10), fill=(0, 0, 0, 128), width=10)

    img.save(fullpath)

Running this python file named ‘document_generator’ will create documents on every iteration. 

Let’s see one of the output document images created by the program.


This is one of the documents generated by the program, you can see that the document masks sensitive information. In this way, we can train data extractors from documents containing sensitive information without worrying about privacy.