strange character encoding of stored data , old script is showing them fine new one doesn't
I'm trying to rewrite an old website .
it's in persian which uses perso/arabic characters .
CREATE DATABASE `db` DEFAULT CHARACTER SET utf8 COLLATE utf8_persian_ci;
USE `db`;
Almost all my table/columns COLLATE are set to utf8_persian_ci
I'm using codeigniter for my new script and i have
'char_set' => 'utf8',
'dbcollat' => 'utf8_persian_ci',
In the database settings , so there is no problem there .
So here is the strange part
The old script is using some sort of database engine called TUBADBENGINE
or TUBA DB ENGINE
... nothing special .
When i enter some data in the database (in persian) using the old script , when i look into database , characters are stored like عمران
.
The old script fetch/shows that data fine , but the new script shows them with the same weird font/charset as database
So when i enter اااا
, database stored data looks like عمراÙ
, when i fetch it in the new script i see عمراÙ
but in the old script i see اااا
CREATE TABLE IF NOT EXISTS `tnewsgroups` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`fName` varchar(200) COLLATE utf8_persian_ci DEFAULT NULL,
PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_persian_ci AUTO_INCREMENT=11 ;
--
-- Dumping data for table `tnewsgroups`
--
INSERT INTO `tnewsgroups` (`ID`, `fName`) VALUES
(1, 'عمران'),
(2, 'معماری'),
(3, 'برق'),
(4, 'مکانیک'),
(5, 'test'),
(6, 'test2');
In the other hand when i enter ااااا
directly in the database
Of course i have the same اااا
stored in the database
The new script is showing it fine
But in the old script i get ????
Can anyone make any sense of this ?
Here is the tuba engin
https://github.com/maxxxir/mz-codeigniter-crud/blob/master/tuba.php
Usage example from old script :
define("database_type" , "MYSQL");
define("database_ip" , "localhost");
define("database_un" , "root");
define("database_pw" , "");
define("database_name" , "nezam2");
define("database_connectionstring" , "");
$db = new TUBADBENGINE(database_type , database_ip , database_un , database_pw , database_name , database_connectionstring);
$db->Select("SELECT * FROM tnews limit 3");
if ($db->Lasterror() != "") { echo "<B><Font color=red>ÎØÇ ! áØÝÇ ãÌÏøÏÇ ÊáÇÔ ˜äíÏ"; exit(); }
for ($i = 0 ; $i < $db->Count() ; $i++) {
$row = $db->Next();
var_dump($row);
}
In short, because this has been discussed a thousand times before:
- PHP holds a string, say
"漢字"
, encoded in UTF-8. The bytes for this areE6 BC A2 E5 AD 97
. - It sends this string over a database connection which is set to
latin1
. - The database receives the bytes
E6 BC A2 E5 AD 97
, thinking those representlatin1
characters. - The database stores the characters
æ¼¢å
(the characters thatE6 BC A2 E5 AD 97
maps to inlatin1
). - The same process reversed makes PHP receive the same bytes, which it then treats as UTF-8. The roundtrip works fine for PHP, even though the database doesn't treat the characters as it should.
So the problem here was that the database connection was set incorrectly when the data was entered into the database. You'll have to convert the data in the database to the correct characters. Try this:
SELECT CONVERT(BINARY CONVERT(field_name USING latin1) USING utf8) FROM table_name
Maybe utf8
isn't what you need here, experiment. If that works, change this into an UPDATE
statement to update the data permanently.