A Few MySQL Issues
October 1st, 2009
I got really disillusioned with the StackOverflow Offline Viewer (which I’m calling FlowOnTheGo, at least for now) last week when I tried to load just the question/answer bodies into a testing table I made in MySQL.
The first issue I ran into had a lot to do with escaping quotes. Namely that I wasn’t doing it at all. In a moment of clarity, I realized that I could just use parameterized queries and everything would be ok. This was actually largely true, but surprisingly I was still getting a fair number of problem rows. After a bit of investigation, I discovered that the common thread between these rows was that they contained unicode characters. I didn’t understand how this could be, because I looked into it a bit and found articles which told me that “as of MySQL version SomethingLowerThanI’mUsing the default charset is utf8.” UTF8 supports unicode, what with it being right there in the name and all (Unicode Transformation Format), but unicode was choking my poor database.
As it turns out, the default charset for my particular installation was latin1, not utf8, so it found unicode characters to be weird and scary things. I changed the charset, and eventually a few other settings related to data transmission, to utf8 and all was right with the world.
All was right, that is, if you happened to have 10 hours to watch the database (in the words of Yahtzee Croshaw) “grind itself retarded.” As it turns out, a program which grabs only one record from an XML file at a time, nicely packages it up in a parameterized query, and sends that one lonely record off to the database is simply not a fast way to get the job done. I did toy around with sending more than one record per INSERT query, but it turns out that that’s just not much faster. Also, you have to be careful about how large your query becomes, because by default you can only send up to 1MB at a time, and I’d really rather that not happen when I release this to the public.
I asked StackOverflow for quick ways to load the database, and the best suggestion I found was to convert my XML data into a CSV and use MySQL’s LOAD DATA command to pull it into my table. I gave it a shot, but once again forgot about the quotes. When all was said and done it only took 3 minutes to write the CSV file and another 3 minutes to load it into the database. I’d call that a significant improvement.
Next on my list of things to do is finding a nice way to preprocess text before sending it to the database. I need to do this for three reasons:
- MySQL’s FULLTEXT indexes, by default, won’t index any word under 4 characters in length. This means that unless I ask everyone who installs the application to modify this server-wide, words like PHP and SQL won’t show up when you search for them. To get around this, I want to change words like XML into XML_Edit_PaddingText.
- Characters which are not alphanumeric, single apostrophes, or the underscore character break words. This means that, word length limitations notwithstanding, the term C# and C++ would each be truncated to C. I want to edit these to C_Edit_Sharp and C_Edit_PlusPlus, respectively.
- MySQL’s “Stop Words” list (i.e. words which aren’t indexed because they’re too common) is a bit too comprehensive for my needs. I think the words ‘get’ and ‘value’ would be relevant enough to programming-related searches that they should be included. These will become get_Edit_StopWord and value_Edit_Stopword, respectively.
I have a feeling this will be a little trickier than I would like, but that’s part of the fun of this project.
Leave a Reply