Reading an Excel file using OLEDB is returning unexpected results when reading surrogate pairs

  c++, excel, oledb

I’m currently trying to read an excel file that contains unicode surrogate pairs. The Excel spreadsheet contains the following table.

ProductID ProductCode ProductName ProductLine ProductType
62 EC2 Trifle Desserts Frozen
64 EC4 ๐Ÿ˜€๐Ÿ˜๐Ÿ˜‚๐Ÿ˜ƒ๐Ÿ˜„๐Ÿ˜…๐Ÿ˜†๐Ÿ˜‡๐Ÿ˜ˆ๐Ÿ˜‰ Desserts Frozen
65 EC5 ใ“ใ‚“ใซใกใฏ Desserts Frozen
66 EC6 ใ‚ญ๐Ÿ˜„๐Ÿ˜…๐Ÿ˜†๐Ÿ˜‡ Desserts Frozen
66 EC7 ๐Ÿ˜‡๐Ÿ˜ˆ๐Ÿ˜‰ Desserts Frozen
66 EC8 ๐Ÿˆ€ Desserts Frozen
66 EC8 ๐Ÿˆ™ Desserts Frozen

I’m then trying to read the table via OLEDB using the following query

SELECT DISTINCT IIF([ProductName] IS NULL, 'BLANK', [ProductName]) AS [ProductName] FROM [Sheet1$]

This returns the following

ProductName
Trifle
ใ“ใ‚“ใซใกใฏ
ใ‚ญ๐Ÿ˜„๐Ÿ˜…๐Ÿ˜†๐Ÿ˜‡
๐Ÿˆ™

In which is has treated all the surrogate pair based entries as non-distinct, which isn’t what I would expect the result to be. I would expect all the entries in the above table to be distinct.

I previously hit a snag when accessing SQL via OLEDB, but that was related to the collation of the SQL column being incorrect for distinguishing surrogate pairs.

I imagine I am doing something wrong with regards to the Excel columnfile format, but haven’t been able to solve my issue. Is there an equivalent of collation for Excel files that might fix this issue? As I’m guessing when it’s deciding if something is distinct, the comparison is treating all surrogate pair based entries as the same.

Source: Windows Questions C++

LEAVE A COMMENT