I've been using ChatGPT to help with various kinds of analysis and coding tasks. Today I had the problem of generating English labels for an ontology that was in Spanish. I thought a description of the steps I took to solve this might be interesting. The ontology is called ITEMAS. It is an ontology about healthcare innovation and the original can be found here: ITEMAS on BioPortal.
The first thing I tried was to use a special version of ChatGPT called the Document Analyzer. For a small fee, you can get access to around 30 different specialized versions of ChatGPT. I purchased this subscription a few months ago to solve a specific problem I was having with a proposal, intending to cancel it after the first month. However, I soon realized that besides the specific feature I needed for the proposal, there were a number of other special ChatGPTs that were very useful such as the Document Analyzer. I tried having the ChatGPT document analyzer read the OWL file and then asked it to create additional labels in English as well as Spanish.
This didn't work. At first, the Document Analyzer could barely parse the ontology. I took a quick look at the ontology and realized the problem: there were no labels. I can understand how this happened. One of the options in Protégé is to create IRIs with user supplied names, an option I usually think is the correct one to use. However, one issue with this option is when you use it then it doesn't generate any labels for new entities. That's why I wrote these SPARQL queries to generate labels from IRIs. However, this also required a bit of extra work because the IRIs weren't consistent. Sometimes they used standards like "My_Class" other times like "myProperty". That's an important lesson: the whole point of standards is to be consistent. Often which standard you use doesn't make much difference but what matters is that you pick one and stick with it. So it took me a little extra work to deal with the different standards and I had to modify those SPARQL queries a bit to deal with the IRIs that used the underscore standard but I soon had labels for all the entities. BTW, I've switched my preference now that the newest version of Protégé uses the underscore by default if you are using user generated IRIs. E.g., if you are using the user generated IRI option and type "My New Class" Protégé creates an IRI that has "My_New_Class" as the last part of the IRI. It didn't use to do that because theoretically you can have blanks in an IRI but I've found that having blanks in the IRI often causes trouble. With the latest version that won't happen so for new ontologies I develop I'm using the underscore for my IRI names. Although you won't find this standard on any of my existing ontologies because I started them using the CamelBack option and as I mentioned you defeat the purpose of standards if you don't stick to the same one. But in future ontologies I'm going to use underscores.
Now I had a version of the ITEMAS ontology with Spanish labels and the Document Analyzer could parse it easier but it still continued to have problems. If you've ever used ChatGPT this may have happened to you, once in a while you keep telling it "do X" and it says "okay I did X" but it didn't. Then you say "No, you still didn't do X try again and this time make sure to do X and by X I mean..." and it says "okay this time I did X" and nope still didn't. The "do X" in this case was generating rdfs:label values in English. It kept saying it would do that but it kept generating another set of Spanish labels. After a while I realized that the Document Analyzer probably wasn't the best tool to use. It can understand Turtle but in the past where it most excels (no pun intended) is dealing with CSV files. So I chose another special ChatGPT called the Code Tutor. I've had great results with this one generating SPARQL queries that are a bit complex, such as complex REGEX matching. It always takes me several tries to figure out a non-trivial REGEX but the Code Tutor (at least so far) gets them right the first time.
So I used a SPARQL query to generate all the existing Spanish labels and saved them to a file, one label per line. Then I took approximately the first 25 lines, fed those to the Code Tutor as well as a sample SPARQL query to transform one label from Spanish to English. I also moved the Spanish labels to the skos:altLabel property so I would still have them and would use the SPARQL query to put the English labels on the rdfs:label property. The first SPARQL query that the Code Tutor generated (this is a condensed version, there were more properties but just to get the idea) was:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
INSERT {?s rdfs:label "Organization"@en .
?s rdfs:label "Commercial Action"@en .
?s rdfs:label "Development Activity"@en .
?s rdfs:label "R&D Activity"@en .}
WHERE {?s skos:altLabel "Organizacion"^^xsd:string .
?s skos:altLabel "Accion Comercial"^^xsd:string .
?s skos:altLabel "Actividad Desarrollo"^^xsd:string .
?s skos:altLabel "Actividad I D i"^^xsd:string .}
I tried this and... of course it didn't work. Can you spot the problem? It took me a few minutes but then it was obvious. This query will only match an entity that has "Organizacion" and "Accion Comercial" and all the other labels as a skos:altLabel. I explained the problem to the Code Tutor (working with these things is kind of an interesting experience, I even find myself making a joke to the LLM once in a while and unlike most humans it thinks my jokes are funny). So the new query had the form:
INSERT {?s1 rdfs:label "Organization"@en .
?s2 rdfs:label "Commercial Action"@en .
?s3 rdfs:label "Development Activity"@en .
?s4 rdfs:label "R&D Activity"@en .}
WHERE {?s1 skos:altLabel "Organizacion"^^xsd:string .
?s2 skos:altLabel "Accion Comercial"^^xsd:string .
?s3 skos:altLabel "Actividad Desarrollo"^^xsd:string .
?s4 skos:altLabel "Actividad I D i"^^xsd:string .
This worked for virtually all the labels. There were a handful (approximately 6) that didn't work. I'm not sure why, I think they had some special character that printed different than what is required to match the text but the remaining ones are so few that they can be corrected manually. The revised ontology with Spanish and English labels can be found here: ITEMAS Revised GitHub Repository.
Addendum (11/4/24): I realized that the way I did that wasn't good design because I was using two different properties for English and Spanish. It is easy to change the Renderer options in Protégé to use different properties but it wasn't good design to show preference for English by having Spanish be the skos:altLabel. Instead what I did is to use language tags. I used the following query:
DELETE {?s skos:altLabel ?csl}
INSERT {?s rdfs:label ?tsl}
WHERE {?s skos:altLabel ?csl.
BIND(STRLANG(?csl, "es") AS ?tsl)}
To put both English and Spanish labels as the value of rdfs:label and in Protégé you can set the language tag as needed as shown below ("Set Language" at the bottom of the pop-up form). This shows the skos:altLabel property just for completeness but I've deleted this for this ontology as it is no longer needed. This way the same ontology can be used by both Spanish and English users and by changing the language tag in the Rendering option you can see the language you prefer.
Addendum (11/6/24): I just learned of an interesting tool to do this automatically and in a more controlled way. I haven't tried it but it looks very interesting, especially for large ontologies in Healthcare and other industry applications. It's from a group working on a Sickle Cell Anemia ontology: https://scdontology.h3abionet.org/index.php/ontology-translation-tool/
Comments