How to anonymise a database
I have just sent Tim A a version of our database so that he can play with a system with a large amount of data. Before doing so I scrambled the data to hide the customer and product information using the SQL code given below. Note that the last 3 sql statements are designed to remove the 27000 attachments to hide information and save space. The others do the following:
all phone and fax numbers randomised; all addresses randomised (though the suburb and state have not been altered; all email addresses randomised; all last and company names randomised but keeping the first 3 letters, all first name randomised but keeping the first letter, all Hong Kong ID numbers (held in the travel slot) randomised; all product names and printed names randomised but keeping the first 3 letters; all microchip numbers randomised; users names and passwords randomised (but admin left as admin, and password set to ‘admin’)
I manually scrambled the email and sms gateway user names and passwords and the bank account names and numbers.
If you ever have need to anonymise things, the sql below will save you some time.
Regards, Tim G
PS - having just repeated the exercise, I realised that I omitted one final step. If you do the above, then all the customer data is anonymised, however the derived description field still contains the original information. Hence you need to get this rebuilt. The easiest way is to use Administration|Archetypes to edit the party.customerPerson archetype. Change the derived value for the description node from "concat(party:getBillingAddress(.),' - ',party:getHomeTelephone(.))" to say "concat(party:getBillingAddress(.),' ',party:getHomeTelephone(.))" and then press Apply. When you get the 'Update Changed Nodes' window, press Cancel. Then change the derived value back again, and this time, press the OK button on to get the changed nodes updated. This will rebuild the descriptions from the anonymised data.
UPDATE contact_details
SET value=concat((floor(rand()*9000)+1000),' ',(floor(rand()*9000)+1000))
where name='faxNumber';
UPDATE contact_details
SET value=concat((floor(rand()*9000)+1000),' ',(floor(rand()*9000)+1000))
where name='telephoneNumber';
UPDATE contact_details
SET value=concat((floor(rand()*9000)+1000),' ',CONV(10+FLOOR(RAND()*26),10,36),' Street')
where name='address';
UPDATE contact_details
SET value=concat(lower(CONV(10+FLOOR(RAND()*26),10,36)),(floor(rand()*9000)+1000),'@gmail.com')
where name='emailAddress';
UPDATE entity_details
SET value=concat(left(value,3),(floor(rand()*9000)+1000))
where name='lastName';
UPDATE entity_details
SET value=concat(left(value,1),(floor(rand()*9000)+1000))
where name='firstName';
UPDATE entity_details
SET value=concat(left(value,3),(floor(rand()*9000)+1000))
where name='companyName';
UPDATE entity_details
SET value=concat(upper(CONV(10+FLOOR(RAND()*26),10,36)),(floor(rand()*9000)+1000))
where name='travel';
UPDATE entity_details
SET value=concat(left(value,3),(floor(rand()*9000)+1000))
where name='printedName';
UPDATE entities
SET name=concat(left(name,3),(floor(rand()*9000)+1000))
where arch_short_name like 'product.%';
UPDATE entity_identities
SET identity=concat((floor(rand()*900)+100),'*',(floor(rand()*900)+100),'*',(floor(rand()*900)+100))
,name = identity
where arch_short_name = 'entityIdentity.microchip';
UPDATE entities
SET name=concat(left(name,3),(floor(rand()*9000)+1000))
where arch_short_name='security.user' and length(name)>3 and name <>'admin';
UPDATE entities
SET description=concat('Dr ',left(name,3),(floor(rand()*9000)+1000))
where arch_short_name='security.user' and length(name)<=3;
UPDATE users
SET password=concat('p',(floor(rand()*9000)+1000))
where user_name <> 'admin';
UPDATE documents
SET contents=0, checksum=0, doc_size=0
where (mime_type <> 'application/vnd.oasis.opendocument.text' and mime_type<> 'text/xml');
UPDATE acts a
join document_acts da on (da.document_act_id = a.act_id)
join documents d on (d.document_id = da.document_id)
SET d.contents=0, d.checksum=0, d.doc_size=0
where (a.arch_short_name = 'act.patientDocumentLetter'
or a.arch_short_name = 'act.patientDocumentForm');
OPTIMIZE table documents;
Re: How to anonymise a database
Nice SQL Tim.... although this interests me...I noticed to you are storing microchips in the form
###*###*###
I think this is the same issue I have mentioned before. Hong Kong uses avid chips, which are markedly different from chips anywhere else in the world.
Everywhere else we use a 15 digit usually written as ### ### ### ### ###
I have always had a problem with the way OPENVPMS stores microchips and advocated we include as a standard two variants to the identity.microchip archetype.
I argued that the chip numbers be stored with no spaces and no extranous marks. When displayed the archetype can specify a display form or it could be customized.
This would speed searching and minimize data entry variance...god knows I have tried to educate my staff into entering 15 digit chip numbers with NO spaces, but they argue the readibility is poor.
Re: How to anonymise a database
There is a project that covers this as a possible extension: http://www.openvpms.org/project/microchip-entry-billing
It would be good to get some more feedback on it.
The microchip length and presentation aspect could be split out into a separate project.
-Tim A
Re: How to anonymise a database
Ben - I did play with creating different archetypes for the different chip types. In the end I came back to using one archetype and imposing the following validation:
(\?\?.*)|([0-9]{3} [0-9]{3} [0-9]{3} [0-9]{3} [0-9]{3})|([0-9]{3}\*[0-9]{3}\*[0-9]{3})|([A-Z,0-9]{10})
with the error message:
Format must be 999 999 999 999 999 or 999*999*999 or 10 characters alpha numeric or anything starting ??
As I understand it, although the standard 'in country' chip is Avid, because of ex-pat owners the are a reasonable number of ISO chips (and also some Trovan 5562 and Hex-10 ones) - and the practice will install an ISO chip if you ask for one because the dog is going back overseas. The ?? is there because in a number of cases the RxWorks data was bad (ie did not match any of the allowed patterns) so I simply added a ?? at the beginning and also created a 'Bad Microchip' alert for that patient to alert the staff that if ever this patient re-appeared its microchips needed to be checked.
During the conversion I was also able to 'regularise' all the 15 digit ISO ones to use the 999 999 999 999 999 format.
I did the same thing with the phone numbers - converted all to 9999 9999 format where possible.
So we have taken the opposite approach - enforce the format on data entry and have everyone know that when you are entering search criteria you use the same format as for data entry.
Regards, Tim G
Re: How to anonymise a database
I've created a handbook page here that includes the above. The script is changed slightly to also remove ESCI information.
-Tim A