General

Lutaml::Model XML adapters use a default encoding of UTF-8 for both input and output.

Serialization data to be parsed (deserialization) and serialization data to be exported (serialization) may be in a different character encoding than the default encoding used by the Lutaml::Model XML adapter. This mismatch may lead to incorrect data reading or incompatibilities when exporting data.

The possible values for setting character encoding to are:

  • A valid encoding value, e.g. UTF-8, Shift_JIS, ASCII;

  • nil to use the default encoding of the adapter. The behavior differs based on the adapter used.

    • Nokogiri: UTF-8. The encoding is set to the default encoding of the Nokogiri library, which is UTF-8.

    • Oga: UTF-8. The encoding is set to the default encoding of the Oga library, which uses UTF-8.

    • Ox: ASCII-8bit. The encoding is set to the default encoding of the Ox library, which uses ASCII-8bit.

When the encoding option is not set, the default encoding of UTF-8 is used.

Serialization character encoding (exporting)

General

There are two ways to set the character encoding of the XML document during serialization:

Instance setting

Setting the instance-level encoding option by setting ModelClassInstance.encoding('…​'). This setting only affects serialization.

Per-export setting

Setting the encoding option when calling for serialization action using the ModelClassInstance.to_xml(…​, encoding: …​) method.

Instance setting

The encoding value of an instance sets the character encoding of the XML document during serialization.

Syntax:

ModelClassInstance.encoding = {encoding_value}

Where,

ModelClassInstance

An instance of the class that inherits from Lutaml::Model::Serializable.

{encoding_value}

The encoding of the output data.

Example 1. Character encoding set to instance is reflected in its serialization output
class JapaneseCeramic < Lutaml::Model::Serializable
  attribute :glaze_type, :string
  attribute :description, :string

  xml do
    root 'JapaneseCeramic'
    map_attribute 'glazeType', to: :glaze_type
    map_element 'description', to: :description
  end
end
# Create a new instance with UTF-8 data
> instance = JapaneseCeramic.new(glaze_type: "志野釉", description: "東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)")
#=> #<JapaneseCeramic:0x0000000104ac7240 @glaze_type="志野釉", @description="東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)">

# Set character encoding to Shift_JIS
> instance.encoding = "Shift_JIS"
#=> "Shift_JIS"

# Serialize the instance
> serialization_output = instance.to_xml
#=> #<JapaneseCeramic><glazeType>\x{5FD8}\x{91CE}\x{91C9}</glazeType><description>\x{6771}\x{4EAC}\x{56FD}\x{7ACB}\x{535A}\x{7269}\x{9928}\x{30B3}\x{30EC}\x{30AF}\x{30B7}\x{30E7}\x{30F3}\x{306E}\x{7BC0}\x{8336}\x{7897}\x{300C}\x{6A4B}\x{672C}\x{300D}\x{FF08}\x{6853}\x{5C71}\x{6642}\x{4EE3}\x{FF09}</description></JapaneseCeramic>

# Check character encoding of output
> serialization_output.encoding
#=> "Shift_JIS"

Per-export setting

The encoding option is used in the ModelClass#to_xml(…​, encoding: …​) call to set the character encoding of the XML document during serialization.

The per-export encoding setting supersedes the instance-level encoding setting.

Syntax:

ModelClassInstance.to_xml(encoding: {encoding_value})

Where,

ModelClassInstance

An instance of the class that inherits from Lutaml::Model::Serializable.

{encoding_value}

The encoding of the output data.

The following class will parse the XML snippet below:

class Ceramic < Lutaml::Model::Serializable
  attribute :potter, :string
  attribute :description, :string
  attribute :temperature, :integer

  xml do
    root 'ceramic'
    map_element 'potter', to: :potter
    map_content to: :description
  end
end
<ceramic><potter>John &#x0026; Jane</potter> A &#x2211; series of &#x220F; porcelain &#xB5; vases.</ceramic>
# Object with attributes
> ceramic_instance = Ceramic.new(potter: "John & Jane", description: " A ∑ series of ∏ porcelain µ vases.")
> #<Ceramic:0x0000000104ac7240 @potter="John & Jane", @description=" A ∑ series of ∏ porcelain µ vases.">

# Parsing the XML snippet with the default encoding of UTF-8
> ceramic_parsed = Ceramic.from_xml(xml)
> #<Ceramic:0x0000000104ac7242 @potter="John & Jane", @description=" A ∑ series of ∏ porcelain µ vases.">

# Object with attributes is equal to the parsed object
> ceramic_parsed === ceramic_instance
> # true

# Using the default encoding of UTF-8
> ceramic_instance.to_xml
> #<ceramic><potter>John &amp; Jane</potter> A ∑ series of ∏ porcelain µ vases.</ceramic>

# Using the default encoding of the adapter, which is UTF-8 in this case
> ceramic_instance.to_xml(encoding: nil)
> #<ceramic><potter>John &amp; Jane</potter> A &#x2211; series of &#x220F; porcelain &#xB5; vases.</ceramic>

# Using ASCII encoding
> ceramic_instance.to_xml(encoding: "ASCII")
> #<ceramic><potter>John &amp; Jane</potter> A &#8721; series of &#8719; porcelain &#181; vases.</ceramic>
Example 2. Character encoding set at to_xml overrides instance encoding
class JapaneseCeramic < Lutaml::Model::Serializable
  attribute :glaze_type, :string
  attribute :description, :string

  xml do
    root 'JapaneseCeramic'
    map_attribute 'glazeType', to: :glaze_type
    map_element 'description', to: :description
  end
end
# Create a new instance with UTF-8 data
> instance = JapaneseCeramic.new(glaze_type: "志野釉", description: "東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)")
#=> #<JapaneseCeramic:0x0000000104ac7240 @glaze_type="志野釉", @description="東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)">

# Set character encoding to Shift_JIS
> instance.encoding = "Shift_JIS"
#=> "Shift_JIS"

# Serialize the instance
> serialization_output = instance.to_xml(encoding: "UTF-8")
#=> #<JapaneseCeramic><glazeType>志野釉</glazeType><description>東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)</description></JapaneseCeramic>

# Check character encoding of output
> serialization_output.encoding
#=> "UTF-8"

Deserialization character encoding (parsing)

The character encoding of the XML document being parsed is specified using the encoding option when the ModelClass.from_{format}(…​) is called.

Syntax:

ModelClass.from_{format}(string_in_format, encoding: {encoding_value})

Where,

ModelClass

The class that inherits from Lutaml::Model::Serializable.

{format}

The format of the input data, e.g. xml, json, yaml, toml.

string_in_format

The input data in the specified format.

{encoding_value}

The encoding of the input data.

Example 3. Setting the encoding option during parsing data not encoded in the default encoding (UTF-8)

Using the definition of JapaneseCeramic at Instance setting.

This XML snippet is in Shift-JIS.

<JapaneseCeramic>
  <glazeType>\x{5FD8}\x{91CE}\x{91C9}</glazeType>
  <description>\x{6771}\x{4EAC}\x{56FD}\x{7ACB}\x{535A}\x{7269}\x{9928}\x{30B3}\x{30EC}\x{30AF}\x{30B7}\x{30E7}\x{30F3}\x{306E}\x{7BC0}\x{8336}\x{7897}\x{300C}\x{6A4B}\x{672C}\x{300D}\x{FF08}\x{6853}\x{5C71}\x{6642}\x{4EE3}\x{FF09}</description>
</JapaneseCeramic>
# Parse the XML snippet with the encoding of Shift_JIS
> instance = JapaneseCeramic.from_xml(xml, encoding: "Shift_JIS")
#=> #<JapaneseCeramic:0x0000000104ac7240 @glaze_type="志野釉", @description="東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)">

# Check character encoding of the instance
> instance.encoding
#=> "Shift_JIS"

# Serialize the instance using UTF-8
> serialization_output = instance.to_xml(encoding: "UTF-8")
#=> #<JapaneseCeramic><glazeType>志野釉</glazeType><description>東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)</description></JapaneseCeramic>
> serialization_output.encoding
#=> "UTF-8"
Example 4. When the encoding option is not set, the default encoding of the adapter is used

Using the definition of JapaneseCeramic at Instance setting.

This XML snippet is in UTF-8.

<JapaneseCeramic>
  <glazeType>志野釉</glazeType>
  <description>東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)</description>
</JapaneseCeramic>

In adapters that use a default encoding of UTF-8, the content is parsed properly.

> instance = JapaneseCeramic.from_xml(xml, encoding: nil)
#=> #<JapaneseCeramic:0x0000000104ac7240 @glaze_type="志野釉", @description="東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)">
> instance.encoding
#=> "UTF-8"
> serialization_output = instance.to_xml
#=> #<JapaneseCeramic><glazeType>志野釉</glazeType><description>東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)</description></JapaneseCeramic>
> serialization_output.encoding
#=> "UTF-8"

In adapters that use a default encoding of ASCII-8bit, the content becomes malformed.

> instance = JapaneseCeramic.from_xml(xml, encoding: nil)
#=> #<JapaneseCeramic:0x0000000104ac7240 @glaze_type="菑", @description="東京国立博物館コレクションの篠茶碗橋本桃山時代">
> instance.encoding
#=> "ASCII-8bit"
> serialization_output = instance.to_xml
#=> #<JapaneseCeramic><glazeType>菑</glazeType><description>東京国立博物館コレクションの篠茶碗橋本桃山時代</description></JapaneseCeramic>
> serialization_output.encoding
#=> "ASCII-8bit"
Example 5. Using an invalid encoding to deserialize causes data corruption

Using the definition of JapaneseCeramic at Instance setting.

This XML snippet is in UTF-8.

<JapaneseCeramic>
  <glazeType>志野釉</glazeType>
  <description>東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)</description>
</JapaneseCeramic>
> JapaneseCeramic.from_xml(xml, encoding: "Shift_JIS")
#=> #<JapaneseCeramic:0x0000000104ac7240 @glaze_type="菑pP", @description="東京国立博物館コレクションの篠茶碗橋本桃山時代">