General
Lutaml::Model XML adapters use a default encoding of UTF-8 for both input and output.
Serialization data to be parsed (deserialization) and serialization data to be exported (serialization) may be in a different character encoding than the default encoding used by the Lutaml::Model XML adapter. This mismatch may lead to incorrect data reading or incompatibilities when exporting data.
The possible values for setting character encoding to are:
-
A valid encoding value, e.g.
UTF-8,Shift_JIS,ASCII; -
nilto use the default encoding of the adapter. The behavior differs based on the adapter used.-
Nokogiri:
UTF-8. The encoding is set to the default encoding of the Nokogiri library, which isUTF-8. -
Oga:
UTF-8. The encoding is set to the default encoding of the Oga library, which usesUTF-8. -
Ox:
ASCII-8bit. The encoding is set to the default encoding of the Ox library, which usesASCII-8bit.
-
When the encoding option is not set, the default encoding of UTF-8 is used.
Serialization character encoding (exporting)
General
There are two ways to set the character encoding of the XML document during serialization:
- Instance setting
-
Setting the instance-level
encodingoption by settingModelClassInstance.encoding('…'). This setting only affects serialization. - Per-export setting
-
Setting the
encodingoption when calling for serialization action using theModelClassInstance.to_xml(…, encoding: …)method.
Instance setting
The encoding value of an instance sets the character encoding of the XML document during serialization.
Syntax:
ModelClassInstance.encoding = {encoding_value}Where,
ModelClassInstance-
An instance of the class that inherits from Lutaml::Model::Serializable.
{encoding_value}-
The encoding of the output data.
class JapaneseCeramic < Lutaml::Model::Serializable
attribute :glaze_type, :string
attribute :description, :string
xml do
root 'JapaneseCeramic'
map_attribute 'glazeType', to: :glaze_type
map_element 'description', to: :description
end
end# Create a new instance with UTF-8 data
> instance = JapaneseCeramic.new(glaze_type: "志野釉", description: "東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)")
#=> #<JapaneseCeramic:0x0000000104ac7240 @glaze_type="志野釉", @description="東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)">
# Set character encoding to Shift_JIS
> instance.encoding = "Shift_JIS"
#=> "Shift_JIS"
# Serialize the instance
> serialization_output = instance.to_xml
#=> #<JapaneseCeramic><glazeType>\x{5FD8}\x{91CE}\x{91C9}</glazeType><description>\x{6771}\x{4EAC}\x{56FD}\x{7ACB}\x{535A}\x{7269}\x{9928}\x{30B3}\x{30EC}\x{30AF}\x{30B7}\x{30E7}\x{30F3}\x{306E}\x{7BC0}\x{8336}\x{7897}\x{300C}\x{6A4B}\x{672C}\x{300D}\x{FF08}\x{6853}\x{5C71}\x{6642}\x{4EE3}\x{FF09}</description></JapaneseCeramic>
# Check character encoding of output
> serialization_output.encoding
#=> "Shift_JIS"Per-export setting
The encoding option is used in the ModelClass#to_xml(…, encoding: …) call to set the character encoding of the XML document during serialization.
The per-export encoding setting supersedes the instance-level encoding setting.
Syntax:
ModelClassInstance.to_xml(encoding: {encoding_value})Where,
ModelClassInstance-
An instance of the class that inherits from Lutaml::Model::Serializable.
{encoding_value}-
The encoding of the output data.
The following class will parse the XML snippet below:
class Ceramic < Lutaml::Model::Serializable
attribute :potter, :string
attribute :description, :string
attribute :temperature, :integer
xml do
root 'ceramic'
map_element 'potter', to: :potter
map_content to: :description
end
end<ceramic><potter>John & Jane</potter> A ∑ series of ∏ porcelain µ vases.</ceramic># Object with attributes
> ceramic_instance = Ceramic.new(potter: "John & Jane", description: " A ∑ series of ∏ porcelain µ vases.")
> #<Ceramic:0x0000000104ac7240 @potter="John & Jane", @description=" A ∑ series of ∏ porcelain µ vases.">
# Parsing the XML snippet with the default encoding of UTF-8
> ceramic_parsed = Ceramic.from_xml(xml)
> #<Ceramic:0x0000000104ac7242 @potter="John & Jane", @description=" A ∑ series of ∏ porcelain µ vases.">
# Object with attributes is equal to the parsed object
> ceramic_parsed === ceramic_instance
> # true
# Using the default encoding of UTF-8
> ceramic_instance.to_xml
> #<ceramic><potter>John & Jane</potter> A ∑ series of ∏ porcelain µ vases.</ceramic>
# Using the default encoding of the adapter, which is UTF-8 in this case
> ceramic_instance.to_xml(encoding: nil)
> #<ceramic><potter>John & Jane</potter> A ∑ series of ∏ porcelain µ vases.</ceramic>
# Using ASCII encoding
> ceramic_instance.to_xml(encoding: "ASCII")
> #<ceramic><potter>John & Jane</potter> A ∑ series of ∏ porcelain µ vases.</ceramic>to_xml overrides instance encodingclass JapaneseCeramic < Lutaml::Model::Serializable
attribute :glaze_type, :string
attribute :description, :string
xml do
root 'JapaneseCeramic'
map_attribute 'glazeType', to: :glaze_type
map_element 'description', to: :description
end
end# Create a new instance with UTF-8 data
> instance = JapaneseCeramic.new(glaze_type: "志野釉", description: "東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)")
#=> #<JapaneseCeramic:0x0000000104ac7240 @glaze_type="志野釉", @description="東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)">
# Set character encoding to Shift_JIS
> instance.encoding = "Shift_JIS"
#=> "Shift_JIS"
# Serialize the instance
> serialization_output = instance.to_xml(encoding: "UTF-8")
#=> #<JapaneseCeramic><glazeType>志野釉</glazeType><description>東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)</description></JapaneseCeramic>
# Check character encoding of output
> serialization_output.encoding
#=> "UTF-8"Deserialization character encoding (parsing)
The character encoding of the XML document being parsed is specified using the encoding option when the ModelClass.from_{format}(…) is called.
Syntax:
ModelClass.from_{format}(string_in_format, encoding: {encoding_value})Where,
ModelClass-
The class that inherits from Lutaml::Model::Serializable.
{format}-
The format of the input data, e.g.
xml,json,yaml,toml. string_in_format-
The input data in the specified format.
{encoding_value}-
The encoding of the input data.
encoding option during parsing data not encoded in the default encoding (UTF-8)Using the definition of JapaneseCeramic at Instance setting.
This XML snippet is in Shift-JIS.
<JapaneseCeramic>
<glazeType>\x{5FD8}\x{91CE}\x{91C9}</glazeType>
<description>\x{6771}\x{4EAC}\x{56FD}\x{7ACB}\x{535A}\x{7269}\x{9928}\x{30B3}\x{30EC}\x{30AF}\x{30B7}\x{30E7}\x{30F3}\x{306E}\x{7BC0}\x{8336}\x{7897}\x{300C}\x{6A4B}\x{672C}\x{300D}\x{FF08}\x{6853}\x{5C71}\x{6642}\x{4EE3}\x{FF09}</description>
</JapaneseCeramic># Parse the XML snippet with the encoding of Shift_JIS
> instance = JapaneseCeramic.from_xml(xml, encoding: "Shift_JIS")
#=> #<JapaneseCeramic:0x0000000104ac7240 @glaze_type="志野釉", @description="東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)">
# Check character encoding of the instance
> instance.encoding
#=> "Shift_JIS"
# Serialize the instance using UTF-8
> serialization_output = instance.to_xml(encoding: "UTF-8")
#=> #<JapaneseCeramic><glazeType>志野釉</glazeType><description>東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)</description></JapaneseCeramic>
> serialization_output.encoding
#=> "UTF-8"encoding option is not set, the default encoding of the adapter is usedUsing the definition of JapaneseCeramic at Instance setting.
This XML snippet is in UTF-8.
<JapaneseCeramic>
<glazeType>志野釉</glazeType>
<description>東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)</description>
</JapaneseCeramic>In adapters that use a default encoding of UTF-8, the content is parsed properly.
> instance = JapaneseCeramic.from_xml(xml, encoding: nil)
#=> #<JapaneseCeramic:0x0000000104ac7240 @glaze_type="志野釉", @description="東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)">
> instance.encoding
#=> "UTF-8"
> serialization_output = instance.to_xml
#=> #<JapaneseCeramic><glazeType>志野釉</glazeType><description>東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)</description></JapaneseCeramic>
> serialization_output.encoding
#=> "UTF-8"In adapters that use a default encoding of ASCII-8bit, the content becomes malformed.
> instance = JapaneseCeramic.from_xml(xml, encoding: nil)
#=> #<JapaneseCeramic:0x0000000104ac7240 @glaze_type="菑", @description="東京国立博物館コレクションの篠茶碗橋本桃山時代">
> instance.encoding
#=> "ASCII-8bit"
> serialization_output = instance.to_xml
#=> #<JapaneseCeramic><glazeType>菑</glazeType><description>東京国立博物館コレクションの篠茶碗橋本桃山時代</description></JapaneseCeramic>
> serialization_output.encoding
#=> "ASCII-8bit"Using the definition of JapaneseCeramic at Instance setting.
This XML snippet is in UTF-8.
<JapaneseCeramic>
<glazeType>志野釉</glazeType>
<description>東京国立博物館コレクションの篠茶碗「橋本」(桃山時代)</description>
</JapaneseCeramic>> JapaneseCeramic.from_xml(xml, encoding: "Shift_JIS")
#=> #<JapaneseCeramic:0x0000000104ac7240 @glaze_type="菑pP", @description="東京国立博物館コレクションの篠茶碗橋本桃山時代">