This documents a problem, and hopefully an initial solution, around the combination of technologies that is Moodle, an Oracle database, and content that is copy and pasted from Word.
Origins of the problem
The problem first arose at my current institution and its installation of Moodle, which is using Oracle (11g I believe) as the database. It first became apparent through BIM an application I’ve written but was also occasionally occuring with the Moodle wiki.
BIM allows students to create an individual blog on their choice of blog engine – mostly WordPress.com – and then provides management services to staff. Part of that is keeping a copy of all of the students’ blog posts within the Moodle database.
The problem was that some student posts weren’t being inserted into the database. They were reporting the following error
ORA-01704: string literal too long
These same posts work fine on my installation of Moodle – using MySQL – so it appeared to be an Oracle problem. The Oracle error message is related to the strange requirements Oracle has when you want to insert long strings of text. On “good” databases you just do an insert, like anything statement. However, on Oracle, if you are trying to insert a long string into a field that is a BLOB or CLOB, you have to use a different process involving an insert statement putting in a special empty field and then another step.
Gotta love a wonderfully designed and consistent enterprise database.
The question is, what is causing this problem?
After much diagnosis it appears that the presence of “special characters” created by students copying and pasting content from MS Word into their blog is at the core of the problem. When inserting these long posts into the database, what normally happens is that Moodle checks the length of the post, if it is greater than 4000 Moodle will jump through the special (and silly) hoops that Oracle requires.
However, for the problem posts when Moodle checks the length of the post, it is less than 4000. And the posts do have less than 4000 characters. However, when Moodle tries the normal insert process into Oracle, we get the above error message.
It appears that the problem is being caused by the presence of “special characters” from Word. These appear to be “tricking” Oracle into thinking that these posts are greater than 4000 characters.
The solution appears to be to clean up the posts before inserting them into the database.
The Moodle discussion forum uses exactly the same process for inserting discussion forum messages into the database. It doesn’t appear to have this same problem as the HTML editor in Moodle appears to do a reasonable job as cleaning up the “special characters”. Though, this might not be 100% successful.
In a perfect world, I want to put in some PHP code into BIM (at first, and perhaps into Moodle later) that does this cleaning. The obvious question is does or why doesn’t, Moodle support this already.
Existing Moodle support
Moodle has optional support for HTMLPurifier, however, not sure this is an exact match for this purpose. To be clear, the problem here isn’t cleaning up the HTML generated by Word. It’s the special characters for quotes, dashes etc that Word uses. In some cases this is actual “special characters, but for some reason, these are also appearing in the text as things like ’ for a single quote. I realise by that description I’m revealing that I haven’t bothered to dig to far into this, yet.
In addition, my main interest is solving this problem in BIM for the short term. So something that needs to be changed at the Moodle level is of little interest.
There’s this bit of PHP suggestion, essentially what I’m looking for. It does pick up some of the problems, but not all. In particular, it doesn’t deal with the “&#” issue.
Combining a few different bits and pieces brings me to this code
$badchr = array(
'â€œ', // left side double smart quote
'â€'.chr(157), // right side double smart quote
'â€˜', // left side single smart quote
'â€™', // right side single smart quote
'â€¦', // elipsis
'â€”', // em dash
'â€“', // en dash
'’', // single quote
'–' // dash
$goodchr = array('"', '"', "'", "'", "...", "-", "-",
$post = str_replace($badchr, $goodchr, $post);
It seems to work with a couple of ad hoc posts. Need to test it more completely.
Aim here will be to write a test harness that attempts to insert all of the posts made by students so far into the Oracle database. The idea/hope is that this should capture all of the problem posts and give some security that the above code is getting all of the problems.
The testing code is written and running. At first run it comes up with five blogs that have additional problems.
So, I’ve started doing a character by character examination of the posts to find the “funny” characters. I’m then adding these to the “cleaning” process. (Yes, I know this is kludgy).
By the time I’d “fixed” the second of the 5 posts, the subsequent posts were working. So, let’s run the lot again.
Fixed. Had more “special chars” to tweak.