â—‹ if We Send Them (Upload)

As a MySQL or PHP developer, once you pace beyond the comfortable confines of English-only grapheme sets, you quickly observe yourself entangled in the wonderfully wacky globe of UTF-8 encoding.

A Quick UTF-8 Primer

Unicode is a widely-used computing manufacture standard that defines a comprehensive mapping of unique numeric code values to the characters in most of today's written grapheme sets to aid with organisation interoperability and information interchange.

UTF-8 is a variable-width encoding that can represent every grapheme in the Unicode grapheme set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte social club marks in UTF-16 and UTF-32. UTF-8 has go the dominant character encoding for the Www, accounting for more half of all Web pages.

UTF-viii encodes each character using ane to four bytes. The first 128 characters of Unicode correspond ane-to-1 with ASCII, making valid ASCII text also valid UTF-8-encoded text. It is for this reason that systems that are limited to employ of the English language character set are insulated from the complexities that can otherwise arise with UTF-8.

For instance, the Unicode hexidecimal code for the alphabetic character A is U+0041, which in UTF-eight is simply encoded with the single byte 41. In comparison, the Unicode hexidecimal code for the grapheme utf8 symbol is U+233B4, which in UTF-eight is encoded with the four bytes F0 A3 8E B4.

On a previous chore, we began running into data encoding issues when displaying bios of artists from all over the world. Information technology soon became apparent that there were problems with the stored information, as sometimes the data was correctly encoded and sometimes information technology was non.

This led programmers to implement a hodge-podge of patches, sometimes with JavaScript, sometimes with HTML charset meta tags, sometimes with PHP, and and then on. Shortly, we concluded upward with a list of 600,000 artist bios with double- or triple-encoded data, with data being stored in different ways depending on who programmed the feature or implemented the patch. A classical technical rat's nest.

Indeed, navigating through UTF-8 data encoding issues can be a frustrating and hair-pulling experience. This post provides a concise cookbook for addressing these UTF-8 problems when working with PHP and MySQL in item, based on practical experience and lessons learned (and with thank you, in part, to information discovered here and here along the fashion).

Data encoding with UTF-8 unicode for PHP and MySQL makes complex languages simple.

Specifically, we'll encompass the post-obit in this post:

  • Mods you'll need to brand to your php.ini file and PHP code.
  • Mods you'll need to make to your my.ini file and other MySQL-related bug to be enlightened of (including config mods needed if yous're using Sphinx)
  • How to drift data from a MySQL database previously encoded in latin1 to instead utilize a UTF-8 encoding

PHP UTF-viii Encoding – modifications to your php.ini file:

The first affair you need to do is to change your php.ini file to use UTF-8 every bit the default character set:

                      default_charset = "utf-8";                  

(Note: You tin subsequently use phpinfo() to verify that this has been set properly.)

OK absurd, so now PHP and UTF-8 should work simply fine together. Correct?

Well, not exactly. In fact, not even shut.

While this change will ensure that PHP e'er outputs UTF-eight as the graphic symbol encoding (in browser response Content-type headers), you still need to brand a number of modifications to your PHP code to make certain that information technology properly processes and generates UTF-8 characters.

PHP UTF-8 Encoding – modifications to your code:

To be sure that your PHP lawmaking plays well in the UTF-viii information encoding sandbox, here are the things you need to do:

  • Set up UTF-8 as the character fix for all headers output by your PHP code

    In every PHP output header, specify UTF-eight every bit the encoding:

                                  header('Content-Blazon: text/html; charset=utf-8');                          
  • Specify UTF-8 as the encoding type for XML

                                  <?xml version="1.0" encoding="UTF-eight"?>                          
  • Strip out unsupported characters from XML

    Since not all UTF-8 characters are accepted in an XML document, you'll need to strip whatever such characters out from any XML that you lot generate. A useful function for doing this (which I institute here) is the following:

                                  function utf8_for_xml($string)   {     return preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\ten{E000}-\10{FFFD}]+/u',                         ' ', $string);   }                          

    Here's how you lot can use this function in your code:

                                  $safeString = utf8_for_xml($yourUnsafeString);                          
  • Specify UTF-8 as the character set for all HTML content

    For HTML content, specify UTF-8 as the encoding:

                                  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">                          

    In HTML forms, specify UTF-viii equally the encoding:

                                  <form accept-charset="utf-8">                          
  • Specify UTF-viii as the encoding in all calls to htmlspecialchars

    e.g.:

                                  htmlspecialchars($str, ENT_NOQUOTES, "UTF-8")                          

    *Note: As of PHP 5.6.0, default_charset value is used as the default. From PHP 5.4.0, UTF-8 was the default, but prior to PHP 5.4.0, ISO-8859-1 was used as the default. It'south therefore a proficient thought to always explicitly specify UTF-8 to exist safe, even though this argument is technically optional.

    Too note that, for UTF-8, htmlspecialchars and htmlentities can be used interchangeably.

  • Set UTF-eight as the default character set for all MySQL connections

    Specify UTF-viii as the default graphic symbol prepare to use when exchanging data with the MySQL database using mysql_set_charset:

                                  $link = mysql_connect('localhost', 'user', 'password');   mysql_set_charset('utf8', $link);                          

    Notation that, as of PHP v.5.0, mysql_set_charset is deprecated, and mysqli::set_charset should be used instead:

                                  $mysqli = new mysqli("localhost", "my_user", "my_password", "test");        /* check connection */   if (mysqli_connect_errno()) {       printf("Connect failed: %s\n", mysqli_connect_error());       exit();   }        /* alter grapheme prepare to utf8 */   if (!$mysqli->set_charset("utf8")) {       printf("Error loading grapheme set utf8: %s\n", $mysqli->error);   } else {       printf("Current grapheme set: %s\n", $mysqli->character_set_name());   }        $mysqli->close();                          
  • Ever employ UTF-8 compatible versions of string manipulation functions

    There are several PHP functions that will fail, or at least not behave as expected, if the character representation needs more than 1 byte (equally UTF-8 does). An example is the strlen office that volition return the number of bytes rather than the number of characters.

    Ii options are available for dealing with this:

    • The iconv functions that are available past default with PHP provide multibyte compatible versions of many of these functions (e.thou., iconv_strlen, etc.). Retrieve, though, that the strings you provide to these functions must themselves exist properly encoded.

    • There is likewise the mbstring extension to PHP (information on enabling and configuring it is available here). This extension provides a comprehensive ready of functions that properly account for multibyte encoding.

MySQL UTF-8 Encoding – modifications to your my.ini file:

On the MySQL/UTF-8 side of things, modifications to the my.ini file are required as follows:

  • Set the post-obit config parameters after each corresponding tag:

                                  [customer]   default-grapheme-set=UTF-viii        [mysql]   default-grapheme-prepare=UTF-8        [mysqld]   character-gear up-customer-handshake = false #force encoding to uft8   graphic symbol-set-server=UTF-8   collation-server=UTF-8_general_ci        [mysqld_safe]   default-grapheme-set=UTF-8                          
  • Afterward making the higher up changes to your my.ini file, restart your MySQL daemon.

  • To verify that everything has properly been fix to employ the UTF-8 encoding, execute the following query:

                                  mysql> show variables similar 'char%';                          

    The output should expect something like:

                                  | character_set_client        | UTF-8                              | character_set_connection    | UTF-8                              | character_set_database      | UTF-8                              | character_set_filesystem    | binary                           | character_set_results       | UTF-8                              | character_set_server        | UTF-8                              | character_set_system        | UTF-eight                              | character_sets_dir          | /usr/share/mysql/charsets/                          

    If you instead see latin1 listed for whatever of these, double-bank check your configuration and make sure y'all've properly restarted your mysql daemon.

MySQL UTF-8 Encoding – other things to consider:

  • MySQL UTF-8 is actually a fractional implementation of the full UTF-8 character gear up. Specifically, MySQL UTF-eight encoding uses a maximum of 3 bytes, whereas 4 bytes are required for encoding the full UTF-8 character gear up. This is fine for all language characters, merely if you demand to back up astral symbols (whose code points range from U+010000 to U+10FFFF), those crave a iv byte encoding which is not supported in MySQL UTF-viii. In MySQL 5.5.three, this was addressed with the addition of support for the utf8mb4 character set which uses a maximum of four bytes per character and thereby supports the full UTF-eight character set. So if y'all're using MySQL v.5.3 or later, employ utf8mb4 instead of UTF-8 as your database/tabular array/row graphic symbol set. More than info is bachelor here.

  • If the connecting customer has no manner to specify the encoding for its communication with MySQL, after the connection is established y'all may have to run the following control/query:

                                  set names UTF-viii;                          
  • When determining the size of varchar fields when modeling the database, don't forget that UTF-8 characters may require every bit many as four bytes per character.

MySQL UTF-8 Encoding – if yous use Sphinx:

  • In your Sphinx configuration file (i.eastward., sphinx.conf):

    • Fix your index definition to have:

                        charset_type = utf-8                                  
    • Add the following to your source definition:

                        sql_query_pre = Ready CHARACTER_SET_RESULTS=UTF-8 sql_query_pre = SET NAMES UTF-viii                                  
  • Restart the engine and remake all indices.

  • If y'all want to configure sphinx so that letters like C c Ć ć Ĉ ĉ Ċ ċ Č č are all treated as equivalent for search purposes, you will need to configure a charset_table (a.k.a. character folding) which is essentially an equivalency mapping between characters. More data is bachelor here.

Migrating database data that is already encoded in latin1 to UTF-viii

If y'all have an existing MySQL database that is already encoded in latin1, here's how to convert the latin1 to UTF-viii:

  1. Make sure you've made all the modifications to the configuration settings in your my.ini file, as described above.

  2. Execute the following control:

                                  ALTER SCHEMA `your-db-name` DEFAULT CHARACTER Set up UTF-8;                          
  3. Via command line, verify that everything is properly set to UTF-eight

                                  mysql> prove variables like 'char%';                          
  4. Create a dump file with latin1 encoding for the tabular array you want to convert:

                                  mysqldump -u USERNAME -pDB_PASSWORD --opt --skip-set up-charset --default-grapheme-set=latin1            --skip-extended-insert DATABASENAME --tables TABLENAME >            DUMP_FILE_TABLE.sql                          

    e.chiliad:

                                  mysqldump -u root --opt --skip-fix-charset  --default-grapheme-prepare=latin1            --skip-extended-insert artists-database --tables tbl_artist >            tbl_artist.sql                          
  5. Do a global search and replace of the charset in the dumpfile from latin1 to UTF-8:

    due east.1000., using Perl:

                                  perl -i -pe 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=UTF-8/' DUMP_FILE_TABLE.sql                          

    Annotation to Windows users: This charset string replacement (from latin1 to UTF-eight) can also be done using find-and-supercede in WordPad (or some other text editor, such every bit vim). Exist sure to salvage the file just as it is though (don't save it equally unicode txt file!).

  6. From this indicate, we will starting time messing with the database data, and then it would probably be prudent to fill-in the database if yous haven't already done so. And so, restore the dump into the database:

                                  mysql> source "DUMP_FILE_TABLE.sql";                          
  7. Search for any records that may not have converted properly and correct them. Since not-ASCII characters are multi-byte past pattern, we can detect them past comparing the byte length to the character length (i.east., to place rows that may hold double-encoded UTF-8 characters that need to be fixed).

    • See if in that location are any records with multi-byte characters (if this query returns nothing, then there don't appear to be any records with multi-byte characters in your tabular array and you tin continue to Step 8).

                                            mysql> select count(*) from MY_TABLE where LENGTH(MY_FIELD) != CHAR_LENGTH(MY_FIELD);                                  
    • Re-create rows with multi-byte characters into a temporary tabular array:

                                            create table temptable (       select * from MY_TABLE where       LENGTH(MY_FIELD) != CHAR_LENGTH(MY_FIELD));                                  
    • Convert double-encoded UTF-8 characters to proper UTF-8 characters

      This is actually a bit tricky. A double encoded string is 1 that was properly encoded as UTF-8. However, MySQL then did us the erroneous favor of converting it (from what it idea was latin1) to UTF-viii once more, when we set the cavalcade to UTF-eight encoding. Resolving this therefore requires a ii step process through which nosotros "trick" MySQL in order to preclude information technology from doing u.s.a. this "favor".

      Commencement, we set the encoding type for the column back to latin1, thereby removing the double encoding:

      east.thousand.:

                                            modify table temptable modify temptable.ArtistName varchar(128) grapheme set latin1;                                  

      Note: Exist certain to utilise the correct field blazon for your table. In the example above, for our table, the correct field blazon for 'ArtistName' was varchar(128), just the field in your tabular array could be text or any other type. Be certain to specify it properly!

      The problem is that now, if we set the cavalcade encoding back to UTF-viii, MySQL volition run the latin1 to UTF-8 data encoding for us again and we'll be back to where we started. To avert this, nosotros alter the column type to blob and Then we fix it to UTF-eight. This exploits the fact that MySQL will not attempt to encode a hulk. We are thereby able to "fool" the MySQL charset conversion to avoid the double encoding result.

      due east.g.:

                                            modify tabular array temptable modify temptable.ArtistName hulk;   change table temptable change temptable.ArtistName varchar(128) graphic symbol set UTF-8;                                  

      (Again, every bit noted above, be sure to use the proper field blazon for your table.)

    • Remove rows with only single-byte characters from the temporary table:

                                            delete from MY_TABLE where LENGTH(MY_FIELD) = CHAR_LENGTH(MY_FIELD);                                  
    • Re-insert stock-still rows back into the original table (earlier doing this, you lot may want to run some selects on the temptable to verify that it appears to exist properly corrected, just every bit a sanity check).

                                            replace into MY_TABLE (select * from temptable);                                  
  8. Verify the remaining data and, if necessary, repeat the process in footstep 7 (this could be necessary, for example, if the information was triple encoded). Farther errors, if whatever, may exist easiest to resolve manually.

Source code and resource files

One other thing to call back and verify is that your source code files, resources files, and so on, are all being saved properly with UTF-eight data encoding. Otherwise, any "special" characters in these files may not be handled correctly.

In Netbeans, for example, you can right-click on your project, choose properties and and so in "Sources" y'all will discover the data encoding choice (it usually defaults to UTF-viii, but it'due south worth checking).

Or in Windows Notepad, use the "Salvage As…" option in the File bill of fare, and select the UTF-8 encoding selection at the bottom of the dialog. (Note that the "Unicode" pick that Notepad provides is actually UTF-xvi, so that'due south not what you desire.)

Wrap-upward

Although it can be somewhat tedious, taking the time to go through these steps to systematically accost your MySQL and PHP UTF-8 data encoding issues tin ultimately salvage you a corking bargain of time and grief. In the long run, this type of methodical approach is far superior to the all-too-mutual trend to but go along patching the system.

This guide hopefully emphasizes the importance of taking the charset definition into consideration when setting up a project environment in the first identify and working in a software project environment that properly accounts for character encoding in its manipulation of text and strings.

davisandeet.blogspot.com

Source: https://www.toptal.com/php/a-utf-8-primer-for-php-and-mysql

0 Response to "â—‹ if We Send Them (Upload)"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel