You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Minimize QR-Code size by encoding data using multiple modes and by using an algorithm that computes the shortest possible bit stream for any given input data
#1444
Closed
AlexGeller1 opened this issue
Sep 30, 2021
· 5 comments
Currently the data is encoded using a single mode so that for example the presence of a single nonnumeric character in an otherwise numeric input string makes the encoder use BYTE encoding in place of a combination of NUMERIC and ALPHANUMERIC or BYTE encoding.
As a result, the GS1 string 010950110153003171407021012345a is currently encoded using 248 bits in BYTE mode requiring a version 4 code whereas it could be encoded in 158 bits fitting in a smaller version 3 code.
A good implementation of compaction could also prevent users from having so specify the encoding via EncodeHintType.CHARACTER_SET. Instead, the algorithm could find the shortest byte representation for multi-language text that possibly uses multiple ECI switches. Examples below illustrate that.
Background:
I stumbled on this topic because I was asked to provide a fix that would allow to place more than one FNC1 in a GS1 QR-Code in order to make a particular scanner work.
It turns out that the request is invalid (unlike all other GS1 codes, GS1 QR-Code does not use FNC1 to delimit AIs) but a side effect of the work is that I implemented the multi-mode encoding so that it minimizes the size.
Make a pull request?
My question is now if a pull request with such code might be considered. In that case I would improve the code to not use input length proportional recursion and to use a leaner internal representation of a solution (possibly without any representation at all and then reconstructing the solution from the memoization array).
Algorithm proposed in the specification (ISO/IEC 18004):
Annex H of the QR-Code specification (titled "Optimization of bit stream length") contains a fast algorithm which is described by "may form the basis of one possible algorithm to determine the shortest bit stream for any given input data".
The speed and memory requirements of this algorithm probably can't be improved, and the implementation looks very straightforward. Moreover, the results are likely completely sufficient for all practical cases but they surely aren't minimal in all cases.
As an example, consider the input "123A". The algorithm from Annex H looks ahead only three characters and hence encodes as NUMERIC(123),BYTE(A) (4+10+10+4+8+8 = 44 bits in versions 1-9) while it could encode more compactly as ALPHANUMERIC(12,3A) (4+9+11+11 = 35 bits in versions 1-9).
Similarly, the string ABCD is minimally encoded as ALPHANUMERIC(AB,CD), while ABCDE is minimally encoded as BYTE(A,B,C,D,E) (solution wouldn't be found with a lookahead of 2) and ABCDEFG is minimally encoded as ALPHANUMERIC(AB,CD,EF),BYTE(G).
The algorithm also doesn't consider switching ECIs to minimize BYTE encoded text.
Example: The string "ŐŜ" is minimally encoded using UTF-8 as ECI(UTF-8),BYTE(Ő,Ŝ) while this string, with the same two characters "ŐŐŜŜ", is minimally encoded using the 2 ECI modes ISO-9959-2 and ISO-8859-6 as ECI(ISO-8859-2),BYTE(Ő,Ő),ECI(ISO-8859-3),BYTE(Ŝ,Ŝ) rather than with just one ECI to UTF-8 because these characters have double-byte representation in UTF-8.
Proposed alternative algorithm:
Alternatively, a greedy algorithm (Dijkstra, dynamic programming or divide and conquer with memoization) yields mathematically minimal solutions as the attached program demonstrates (uses DnC with memoization).
In zxing, minimal encoding should probably be activated on demand by an encoding hint because no matter how efficiently this is implemented, it will not be as efficient as the current code.
Another reason may be that some scanners may have problems reading the optimized codes. During my tests I observed scanning problems with a German application called bctester (It also has problems with some non-optimized codes) and an Android app called "Barcode Scanner". Reading back the images with Google zxing works fine but with the default recognition build into my vanilla Android phone's app, codes using ISO-8859-6 encoding for example do not display the characters correctly (I didn't test but I expect that to fail the same way if that encoding is set via EncodeHintType.CHARACTER_SET using the current implementation).
The class MinimalEncoder.java:
The code can be found at the bottom.
Producing test output:
The class contains a main method for testing. It reads space separated input strings from the command line, computes minimal representations and prints a string representation of each.
Invocation example:
$java -ea MinimalEncoder A AB ABC ABCD ABCDE ABCDEF ABCDEFG 1 12 123 1234 12345 123456 123A A1 A12 A123 A1234 AB1 AB12 AB123 AB1234 ABC1 ABC12 ABC1234 http://foo.com HTTP://FOO.COM 1001114670010%01201220%107211220%140045003267781 --unicodeEscapes '\u0150' --unicodeEscapes '\u015C' --unicodeEscapes '\u0150\u015C' --unicodeEscapes '\u0150\u0150\u015C\u015C' --unicodeEscapes 'abcdef\u0150ghij' --unicodeEscapes '2938928329832983\u01502938928329832983\u015C2938928329832983' --unicodeEscapes 'that particularly stands out to me is \u0625\u0650\u062C\u064E\u0651\u0627\u0635 (\u02BEijj\u0101\u1E63) \u201Cpear\u201D, suggested to have originated from Hebrew \u05D0\u05B7\u05D2\u05B8\u05BC\u05E1 (ag\u00E1s)'
Minimal encoding of string "A" requires 24 bits in version 1 and is encoded as follows:BYTE(A),TERMINATOR()
Minimal encoding of string "AB" requires 28 bits in version 1 and is encoded as follows:ALPHANUMERIC(AB),TERMINATOR()
Minimal encoding of string "ABC" requires 40 bits in version 1 and is encoded as follows:BYTE(A,B,C),TERMINATOR()
Minimal encoding of string "ABCD" requires 39 bits in version 1 and is encoded as follows:ALPHANUMERIC(AB,CD),TERMINATOR()
Minimal encoding of string "ABCDE" requires 56 bits in version 1 and is encoded as follows:BYTE(A,B,C,D,E),TERMINATOR()
Minimal encoding of string "ABCDEF" requires 50 bits in version 1 and is encoded as follows:ALPHANUMERIC(AB,CD,EF),TERMINATOR()
Minimal encoding of string "ABCDEFG" requires 70 bits in version 1 and is encoded as follows:ALPHANUMERIC(AB,CD,EF),BYTE(G),TERMINATOR()
Minimal encoding of string "1" requires 24 bits in version 1 and is encoded as follows:BYTE(1),TERMINATOR()
Minimal encoding of string "12" requires 28 bits in version 1 and is encoded as follows:ALPHANUMERIC(12),TERMINATOR()
Minimal encoding of string "123" requires 28 bits in version 1 and is encoded as follows:NUMERIC(123),TERMINATOR()
Minimal encoding of string "1234" requires 39 bits in version 1 and is encoded as follows:ALPHANUMERIC(12,34),TERMINATOR()
Minimal encoding of string "12345" requires 52 bits in version 1 and is encoded as follows:ALPHANUMERIC(12),NUMERIC(345),TERMINATOR()
Minimal encoding of string "123456" requires 38 bits in version 1 and is encoded as follows:NUMERIC(123,456),TERMINATOR()
Minimal encoding of string "123A" requires 39 bits in version 1 and is encoded as follows:ALPHANUMERIC(12,3A),TERMINATOR()
Minimal encoding of string "A1" requires 28 bits in version 1 and is encoded as follows:ALPHANUMERIC(A1),TERMINATOR()
Minimal encoding of string "A12" requires 40 bits in version 1 and is encoded as follows:BYTE(A,1,2),TERMINATOR()
Minimal encoding of string "A123" requires 39 bits in version 1 and is encoded as follows:ALPHANUMERIC(A1,23),TERMINATOR()
Minimal encoding of string "A1234" requires 52 bits in version 1 and is encoded as follows:ALPHANUMERIC(A1),NUMERIC(234),TERMINATOR()
Minimal encoding of string "AB1" requires 40 bits in version 1 and is encoded as follows:BYTE(A,B,1),TERMINATOR()
Minimal encoding of string "AB12" requires 39 bits in version 1 and is encoded as follows:ALPHANUMERIC(AB,12),TERMINATOR()
Minimal encoding of string "AB123" requires 52 bits in version 1 and is encoded as follows:ALPHANUMERIC(AB),NUMERIC(123),TERMINATOR()
Minimal encoding of string "AB1234" requires 50 bits in version 1 and is encoded as follows:ALPHANUMERIC(AB,12,34),TERMINATOR()
Minimal encoding of string "ABC1" requires 39 bits in version 1 and is encoded as follows:ALPHANUMERIC(AB,C1),TERMINATOR()
Minimal encoding of string "ABC12" requires 56 bits in version 1 and is encoded as follows:BYTE(A,B,C,1,2),TERMINATOR()
Minimal encoding of string "ABC1234" requires 63 bits in version 1 and is encoded as follows:ALPHANUMERIC(AB,C1),NUMERIC(234),TERMINATOR()
Minimal encoding of string "http://foo.com" requires 128 bits in version 1 and is encoded as follows:BYTE(h,t,t,p,:,/,/,f,o,o,.,c,o,m),TERMINATOR()
Minimal encoding of string "HTTP://FOO.COM" requires 94 bits in version 1 and is encoded as follows:ALPHANUMERIC(HT,TP,:/,/F,OO,.C,OM),TERMINATOR()
Minimal encoding of string "1001114670010%01201220%107211220%140045003267781" requires 257 bits in version 2 and is encoded as follows:NUMERIC(100,111,467),ALPHANUMERIC(00,10,%0,12,01,22,0%,10,72,11,22,0%),NUMERIC(140,045,003,267,781),TERMINATOR()
Minimal encoding of string "Ő" requires 36 bits in version 1 and is encoded as follows:ECI(ISO-8859-2),BYTE(.),TERMINATOR()
Minimal encoding of string "Ŝ" requires 36 bits in version 1 and is encoded as follows:ECI(ISO-8859-3),BYTE(.),TERMINATOR()
Minimal encoding of string "ŐŜ" requires 60 bits in version 1 and is encoded as follows:ECI(UTF-8),BYTE(.,.),TERMINATOR()
Minimal encoding of string "ŐŐŜŜ" requires 84 bits in version 1 and is encoded as follows:ECI(ISO-8859-2),BYTE(.,.),ECI(ISO-8859-3),BYTE(.,.),TERMINATOR()
Minimal encoding of string "abcdefŐghij" requires 116 bits in version 1 and is encoded as follows:ECI(ISO-8859-2),BYTE(a,b,c,d,e,f,.,g,h,i,j),TERMINATOR()
Minimal encoding of string "2938928329832983Ő2938928329832983Ŝ2938928329832983" requires 284 bits in version 3 and is encoded as follows:NUMERIC(293,892,832,983,298),ECI(ISO-8859-2),BYTE(3,.),NUMERIC(293,892,832,983,298),ECI(ISO-8859-3),BYTE(3,.,2),NUMERIC(938,928,329,832,983),TERMINATOR()
Minimal encoding of string "that particularly stands out to me is إِجَّاص (ʾijjāṣ) “pear”, suggested to have originated from Hebrew אַגָּס (agás)" requires 1108 bits in version 7 and is encoded as follows:ECI(ISO-8859-6),BYTE(t,h,a,t, ,p,a,r,t,i,c,u,l,a,r,l,y, ,s,t,a,n,d,s, ,o,u,t, ,t,o, ,m,e, ,i,s, ,.,.,.,.,.,.,., ,(),ECI(UTF-8),BYTE(.,i,j,j,.,.,), ,.,p,e,a,r,.,,, ,s,u,g,g,e,s,t,e,d, ,t,o, ,h,a,v,e, ,o,r,i,g,i,n,a,t,e,d, ,f,r,o,m, ,H,e,b,r,e,w, ,.,.,.,.,.,., ,(,a,g,.,s,)),TERMINATOR()
$
Example of using the class to populate a BitArray with a minimal solution:
In the actual patch where this class is called from com.google.zxing.qrcode.encoder.Encoder there is an additional twist since the total size of the minimal encoding may not fit in the computed solution.
The function Encoder.willFit() is used to check this and by the following loop up to two additional calls to the minimal encoder are made to find a fitting version:
MinimalEncoder.ResultNodern = MinimalEncoder.encode(content,version);
while (!willFit(rn.getSize(), rn.getVersion(ecLevel), ecLevel)) {
if (rn.getVersion(ecLevel).getVersionNumber() <= 26) {
intnextVersionNumber = rn.getVersion(ecLevel).getVersionNumber() <= 9 ? 10 : 27 ;
System.err.println("DEBUG: INFO: " + rn.getSize() + " bits don't fit in version " + rn.getVersion(ecLevel) + ". Trying next size " + nextVersionNumber + "..");
rn = MinimalEncoder.encode(content,Version.getVersionForNumber(nextVersionNumber));
} else {
thrownewWriterException("Data too big for any version");
}
}
Best regards,
Alex
Source of MinimalEncoder.java
//package com.google.zxing.qrcode.encoder;importcom.google.zxing.qrcode.decoder.Mode;
importcom.google.zxing.qrcode.decoder.Version;
importcom.google.zxing.common.BitArray;
importcom.google.zxing.common.CharacterSetECI;
importcom.google.zxing.WriterException;
importcom.google.zxing.qrcode.decoder.ErrorCorrectionLevel;
importjava.nio.charset.Charset;
importjava.nio.charset.CharsetEncoder;
importjava.nio.charset.StandardCharsets;
importjava.util.Vector;
importjava.io.UnsupportedEncodingException;
importjava.nio.charset.UnsupportedCharsetException;
/* Experimental encoder that encodes minimally using Divide and Conquer with Memoization * * Major limitation: * The implementation currently recurses for every character in the input so that it will overflow the stack on longer input. * * Version selection: * The version can be preset in the constructor. If it isn't specified then the algorithm will compute three solutions for the three different version classes 1-9, 10-26 and 27-40. * * It is not clear to me if ever a solution using for example Medium (Versions 10-26) could be smaller than a Small solution (Versions 1-9) (proof for or against would be nice to have). * With hypothetical values for the number of length bits, the number of bits per mode and the number of bits per encoded character it can be shown that it can happen at all as follows: * We hypothetically assume that a mode is encoded using 1 bit (instead of 4) and a character is encoded in BYTE mode using 2 bit (instead of 8). Using these values we now attempt to encode the * four characters "1234". * If we furthermore assume that in Version 1-9 the length field has 1 bit length so that it can encode up to 2 characters and that in Version 10-26 it has 2 bits length so that we can encode up * to 2 characters then it is more efficient to encode with Version 10-26 than with Version 1-9 as shown below: * * Number of length bits small version (1-9): 1 * Number of length bits large version (10-26): 2 * Number of bits per mode item: 1 * Number of bits per character item: 2 * BYTE(1,2),BYTE(3,4): 1+1+2+2,1+1+2+2=12 bits * BYTE(1,2,3,4): 1+2+2+2+2+2 =11 bits * * If we however change the capacity of the large encoding from 2 bit to 4 bit so that it potentially can encode 16 items, then it is more efficient to encode using the small encoding * as shown below: * * Number of length bits small version (1-9): 1 * Number of length bits large version (10-26): 4 * Number of bits per mode item: 1 * Number of bits per character item: 2 * BYTE(1,2),BYTE(3,4): 1+1+2+2,1+1+2+2=12 bits * BYTE(1,2,3,4): 1+4+2+2+2+2 =13 bits * * But as mentioned, it is not clear to me if this can ever happen with the actual values. * * ECI switching: * * In multi language content the algorithm selects the most compact representation using ECI modes. For example the * it is more compactly represented using one ECI to UTF-8 rather than two ECIs to ISO-8859-6 and ISO-8859-1 if the text contains more ASCII characters (since they are represented as * one byte sequence) as opposed to the case where there are proportionally more Arabic characters that require two bytes in UTF-8 and only one in ISO-8859-6. */publicclassMinimalEncoder {
finalbooleanprintStats = false; //print memoization memory usageenumVersionSize {
Small,
Medium,
Large;
publicStringtoString() {
return"Small".equals(name()) ? "version 1-9" : "Medium".equals(name()) ? "version 10-26" : "version 27-40";
}
}
StringstringToEncode;
Versionversion = null;
CharsetEncoder[] encoders;
//Average usage density should perhaps be monitored to see if not a hashtable is preferable. The GS1 test example uses 67% of an array of 1727 entries. Encoding the URL 'http://www.google.com'//uses 27% of an array of size 756.//the dimension sizes of the array are://- position-in-input: length of input in characters//- character-encoding: minimally 3 (ISO-8859-1, UTF-8 and UTF-16). The input is scanned up front and for example ISO-8859-6 is added if an arabic character is encountered.//- version-class: exactly 3 representing version 1-9, 10-26 and 27-40.//- Mode: exactly 4 (representing the modes KANJI, ALPHANUMERIC, NUMERIC and BYTE)ResultNode[][][][] memoizedResults; //[position-in-input][character-encoding][version-class][Mode]/**Encoding is optional (default ISO-8859-1) and version is optional (minimal version is computed if not specified*///TODO: Add possibility to force character encoding to cater for EncodeHintType.CHARACTER_SET.MinimalEncoder(StringstringToEncode,Versionversion) {
this.stringToEncode = stringToEncode;
if (version != null) {
this.version = version;
}
CharsetEncoder[] isoEncoders = newCharsetEncoder[15];
isoEncoders[0] = StandardCharsets.ISO_8859_1.newEncoder();
for (inti = 0; i < stringToEncode.length(); i++) {
intcnt = 0;
intj;
for (j = 0; j < 15; j++) {
if (isoEncoders[j] != null) {
cnt++;
if (isoEncoders[j].canEncode(stringToEncode.charAt(i))) {
break;
}
}
}
if (cnt == 14) { //we need all. Can stop looking further.break;
}
if (j >= 15) { //no encoder found for (j = 0; j < 15; j++) {
if (j != 11 && isoEncoders[j] == null) { // ISO-8859-12 doesn't existtry {
CharsetEncoderce = Charset.forName("ISO-8859-" + (j + 1)).newEncoder();
if (ce.canEncode(stringToEncode.charAt(i))) {
isoEncoders[j] = ce;
break;
}
} catch (UnsupportedCharsetExceptione) { }
}
}
}
}
intnumberOfEncoders = 0;
for (intj = 0; j < 15; j++) {
if (isoEncoders[j] != null) {
numberOfEncoders++;
}
}
encoders = newCharsetEncoder[numberOfEncoders + 2];
intindex = 0;
for (intj = 0; j < 15; j++) {
if (isoEncoders[j] != null) {
encoders[index++] = isoEncoders[j];
}
}
encoders[index++] = StandardCharsets.UTF_8.newEncoder();
encoders[index++] = StandardCharsets.UTF_16BE.newEncoder();
memoizedResults = newResultNode[stringToEncode.length()][encoders.length][3][4];
}
publicstaticvoidmain(String[] args) throwsException {
//Invocation examples://java -ea MinimalEncoder A AB ABC ABCD ABCDE ABCDEF 1 12 123 1234 12345 123456 123A A1 A12 A123 A1234 AB1 AB12 AB123 AB1234 ABC1 ABC12 ABC1234 http://foo.com HTTP://FOO.COM 1001114670010%01201220%107211220%140045003267781 --unicodeEscapes '\u0150' --unicodeEscapes '\u015C' --unicodeEscapes '\u0150\u015C' --unicodeEscapes '\u0150\u015C\u0150' --unicodeEscapes 'abcdef\u0150ghij' --unicodeEscapes 'that particularly stands out to me is \u0625\u0650\u062C\u064E\u0651\u0627\u0635 (\u02BEijj\u0101\u1E63) \u201Cpear\u201D, suggested to have originated from Hebrew \u05D0\u05B7\u05D2\u05B8\u05BC\u05E1 (ag\u00E1s)'for (inti = 0; i < args.length; i++) {
Stringinput;
if ("--unicodeEscapes".equals(args[i]) && i + 1 < args.length) {
input = parseUnicodeEscapes(args[i + 1]);
i++;
} elseif ("--nonascii".equals(args[i]) && i + 1 < args.length) {
input = createNonASCII(Integer.parseInt(args[i + 1]));
i++;
} else {
input = args[i];
}
ResultNoderesult = encode(input,null);
System.err.println("Minimal encoding of string \"" + input + "\" requires " + result.getSize() + " bits in version " + result.getVersion(ErrorCorrectionLevel.L) + " and is encoded as follows:" + result.toString());
//testDecode(result);
}
}
staticResultNodeencode(StringstringToEncode,Versionversion) {
returnnewMinimalEncoder(stringToEncode,version).encode();
}
ResultNodeencode() {
if (version == null) { //compute minimal encoding trying the three version sizes.ResultNode[] results = {encode(0,0,VersionSize.Small,null,0),
encode(0,0,VersionSize.Medium,null,0),
encode(0,0,VersionSize.Large,null,0)};
returnaddTerminatorIfNeeded(smallest(results));
} else { //compute minimal encoding for a given versionreturnaddTerminatorIfNeeded(encode(0,0,getVersionSize(version),null,0));
}
}
staticStringparseUnicodeEscapes(Strings) {
Stringresult = "";
for (inti = 0; i < s.length(); i++) {
if (i + 5 < s.length() && s.charAt(i) == '\\' && s.charAt(i + 1) == 'u') {
result += (char) Integer.parseInt(s.substring(i + 2,i + 2 + 4),16);
i += 5;
} else {
result += s.charAt(i);
}
}
returnresult;
}
staticStringcreateNonASCII(intlength) {
Stringresult = "";
while (length-- > 0) {
result += "\u0081";
}
returnresult;
}
staticVersionSizegetVersionSize(Versionversion) {
returnversion.getVersionNumber() <= 9 ? VersionSize.Small : version.getVersionNumber() <= 26 ? VersionSize.Medium : VersionSize.Large;
}
staticVersiongetVersion(VersionSizeversionSize) {
switch (versionSize) {
caseSmall: returnVersion.getVersionForNumber(9);
caseMedium: returnVersion.getVersionForNumber(26);
caseLarge:
default: returnVersion.getVersionForNumber(40);
}
}
staticbooleanisNumeric(charc) {
returnc >= '0' && c <= '9';
}
/* Probably can be implemented in a faster way*/staticbooleanisDoubleByteKanji(charc) {
returnisOnlyDoubleByteKanji("" + c);
}
staticbooleanisAlphanumeric(charc) {
returngetAlphanumericCode(c) != -1;
}
/** Example: to encode alphanumerically at least 2 characters are needed (5.5 bits per character). Similarily three digits are needed to encode numerically (3+1/3 bits per digit)*/staticintgetEncodingGranularity(Modemode) {
switch (mode) {
caseKANJI: return1;
caseALPHANUMERIC: return2;
caseNUMERIC: return3;
caseBYTE: return1;
default:
return0;
}
}
/** Example: to encode alphanumerically 11 bits are used per 2 characters. Similarily 10 bits are used to encode 3 numeric digits.*/staticintgetBitsPerEncodingUnit(Modemode) {
switch (mode) {
caseKANJI: return16;
caseALPHANUMERIC: return11;
caseNUMERIC: return10;
caseBYTE: return8;
caseECI:
default:
return0;
}
}
/** Returns the maximum number of encodeable characters in the given mode for the given version. Example: in Version 1, 2^10 digits or 2^8 bytes can be encoded. In Version 3 it is 2^14 digits and 2^16 bytes*/staticintgetMaximumNumberOfEncodeableCharacters(Versionversion,Modemode) {
intcount = mode.getCharacterCountBits(version);
returncount == 0 ? 0 : 1 << count;
}
staticintgetMaximumNumberOfEncodeableCharacters(VersionSizeversionSize,Modemode) {
returngetMaximumNumberOfEncodeableCharacters(getVersion(versionSize),mode);
}
booleancanEncode(Modemode,charc) {
switch (mode) {
caseKANJI: returnisDoubleByteKanji(c) ;
caseALPHANUMERIC: returnisAlphanumeric(c) ;
caseNUMERIC: returnisNumeric(c) ;
caseBYTE: returntrue; //any character can be encoded as byte(s). Up to the caller to manage splitting into multiple bytes when String.getBytes(Charset) return more than one byte.default:
returnfalse;
}
}
staticintgetCompactedOrdinal(Modemode) {
if (mode == null) {
return0;
}
switch (mode) {
caseKANJI: return0;
caseALPHANUMERIC: return1;
caseNUMERIC: return2;
caseBYTE: return3;
default:
assertfalse;
return -1;
}
}
staticResultNodesmallest(ResultNode[] results) {
ResultNodesmallestResult = null;
for (inti = 0; i < results.length; i++) {
if (smallestResult == null || (results[i] != null && results[i].getSize() < smallestResult.getSize())) {
smallestResult = results[i];
}
}
returnsmallestResult;
}
staticResultNodesmallest(Vector<ResultNode> results) {
ResultNodesmallestResult = null;
for (inti = 0; i < results.size(); i++) {
if (smallestResult == null || (results.get(i) != null && results.get(i).getSize() < smallestResult.getSize())) {
smallestResult = results.get(i);
}
}
returnsmallestResult;
}
ResultNodeaddTerminatorIfNeeded(ResultNoderesult) {
ResultNodecurrent = result;
while (current.next != null) {
current = current.next;
}
//Add TERMINATOR according to "8.4.8 Terminator"current.next = newResultNode(Mode.TERMINATOR,result.version,true,stringToEncode.length(),result.charsetEncoderIndex,null);
if (printStats) {
inttotal = 0;
intused = 0;
for (inti = 0; i < memoizedResults.length; i++) {
for (intj = 0; j < memoizedResults[i].length; j++) {
for (intk = 0; k < memoizedResults[i][j].length; k++) {
for (intl = 0; l < memoizedResults[i][j][k].length; l++) {
total++;
if (memoizedResults[i][j][k][l] != null) {
used++;
}
}
}
}
}
System.err.println("INFO: total size=" + total + ", used=" + used + " (" + String.format("%2.2f",100.0 * used / total) + "%)");
}
returnresult;
}
/**Encode the string stringToEncode for the version size versionSize starting at position position starting in the mode mode. The function returns the number of bits it used to encode the string. The number is minimal. * When the function is called recursively without changing the mode, numberOfCharactersSinceLastModeChange needs to be incremented by one*///TODO: change this method to be iterativeResultNodeencode(intposition,intcharsetEncoderIndex,VersionSizeversionSize,Modemode,intnumberOfCharactersSinceLastModeChange) {
if (memoizedResults[position][charsetEncoderIndex][versionSize.ordinal()][getCompactedOrdinal(mode)] != null) {
returnmemoizedResults[position][charsetEncoderIndex][versionSize.ordinal()][getCompactedOrdinal(mode)];
}
assertposition < stringToEncode.length();
if (mode != null) {
//is is up to the caller to ensure that the number of processed characters is a multiple of the granularity in which characters in the current mode are packed.assertgetEncodingGranularity(mode) == 0 || numberOfCharactersSinceLastModeChange % getEncodingGranularity(mode) == 0;
//is is up to the caller to ensure that the number of processed characters doesn't exceed the maximum number of characters that can follow the current mode type in the current version.assertnumberOfCharactersSinceLastModeChange <= getMaximumNumberOfEncodeableCharacters(versionSize,mode);
}
//compute results for KANJI, ALPHANUMERIC and NUMERICfinalMode[] modes = {Mode.KANJI, Mode.ALPHANUMERIC, Mode.NUMERIC};
Vector<ResultNode> results = newVector<ResultNode>();
for (inti = 0; i < modes.length; i++) {
ModenewMode = modes[i];
intneed = getEncodingGranularity(newMode);
assertneed > 0;
if (position + need <= stringToEncode.length()) {
booleancanEncode = true;
for (intj = 0; j < need; j++) {
if (!canEncode(newMode,stringToEncode.charAt(position + j))) {
canEncode = false;
break;
}
}
if (canEncode) {
booleanneedNewModeToken = mode != newMode || numberOfCharactersSinceLastModeChange + need > getMaximumNumberOfEncodeableCharacters(versionSize,newMode);
ResultNodenext = position + need >= stringToEncode.length() ? null : encode(position + need,charsetEncoderIndex,versionSize,newMode,needNewModeToken ? need : numberOfCharactersSinceLastModeChange + need);
results.add(newResultNode(newMode,getVersion(versionSize),needNewModeToken,position,charsetEncoderIndex,next));
}
}
}
//compute results for BYTEModenewMode = Mode.BYTE;
for (inti = 0; i < encoders.length; i++) {
if (encoders[i].canEncode(stringToEncode.charAt(position))) {
intneed = getBytesOfCharacter(position,i).length;
booleanneedECI = i != charsetEncoderIndex;
booleanneedNewModeToken = needECI || mode != newMode || numberOfCharactersSinceLastModeChange + need > getMaximumNumberOfEncodeableCharacters(versionSize,newMode);
ResultNodenext = position + 1 >= stringToEncode.length() ? null : encode(position + 1,i,versionSize,newMode,needNewModeToken ? need : numberOfCharactersSinceLastModeChange + need);
next = newResultNode(newMode,getVersion(versionSize),needNewModeToken,position,i,next);
if (needECI) {
next = newResultNode(Mode.ECI,getVersion(versionSize),true,position,i,next);
}
results.add(next);
}
}
//choose the smallest resultResultNoderesult = smallest(results);
memoizedResults[position][charsetEncoderIndex][versionSize.ordinal()][getCompactedOrdinal(mode)] = result;
returnresult;
}
byte[] getBytesOfCharacter(intposition,intcharsetEncoderIndex) {
//TODO: Is there a more efficient way for a single character?returnstringToEncode.substring(position,position + 1).getBytes(encoders[charsetEncoderIndex].charset());
}
classResultNode {
Modemode;
Versionversion;
booleandeclaresMode;
intposition;
intcharsetEncoderIndex;
ResultNodenext;
ResultNode(Modemode,Versionversion,booleandeclaresMode,intposition,intcharsetEncoderIndex,ResultNodenext) {
assertmode != null;
this.mode = mode;
this.version = version;
this.declaresMode = declaresMode;
this.position = position;
this.charsetEncoderIndex = charsetEncoderIndex;
this.next = next;
}
/** returns the size in bits*///TODO: change this method to be iterativeintgetSize() {
intsize = declaresMode ? 4 + mode.getCharacterCountBits(version) : 0;
if (mode == Mode.ECI) {
size += 8; // the ECI assignment numbers for ISO-8859-x, UTF-8 and UTF-16 are all 8 bit long
} elseif (mode == Mode.BYTE) {
size += 8 * getBytesOfCharacter(position,charsetEncoderIndex).length;
} else {
size += getBitsPerEncodingUnit(mode);
}
if (next != null) {
size += next.getSize();
}
returnsize;
}
/** returns the length in encoding units*/intgetLength() {
if (getBitsPerEncodingUnit(mode) == 0) {
return0;
}
assertdeclaresMode;
intcount = 1;
ResultNodecurrent = next;
while (current != null && !current.declaresMode) {
count++;
current = current.next;
}
returncount;
}
//TODO: change this method to be iterativepublicvoidgetBits(BitArraybits) throwsWriterException {
// append modebits.appendBits(mode.getBits(),4);
if (mode == Mode.ECI) {
StringcanonicalCharsetName = encoders[charsetEncoderIndex].charset().name();
bits.appendBits(CharacterSetECI.getCharacterSetECIByName(canonicalCharsetName).getValue(),8);
if (next != null) {
next.getBits(bits);
}
} else {
intcharacterLength = getLength() * getEncodingGranularity(mode);
if (characterLength > 0) {
StringcanonicalCharsetName = encoders[charsetEncoderIndex].charset().name();
StringpieceToEncode = stringToEncode.substring(position,position + characterLength);
// append lengthtry {
bits.appendBits(mode == Mode.BYTE ? pieceToEncode.getBytes(canonicalCharsetName).length : characterLength,mode.getCharacterCountBits(version));
} catch (UnsupportedEncodingExceptionuee) {
thrownewWriterException(uee);
}
// append dataappendBytes(pieceToEncode,mode,bits,canonicalCharsetName);
ResultNodecurrent = next;
while (current != null && !current.declaresMode) {
current = current.next;
}
if (current != null) {
current.getBits(bits);
}
} else {
if (next != null) {
next.getBits(bits);
}
}
}
}
publicVersiongetVersion(ErrorCorrectionLevelecLevel) {
intversionNumber = version.getVersionNumber();
intlowerLimit;
intupperLimit;
switch (getVersionSize(version)) {
caseSmall:
lowerLimit = 1;
upperLimit = 9;
break;
caseMedium:
lowerLimit = 10;
upperLimit = 26;
break;
caseLarge:
default:
lowerLimit = 27;
upperLimit = 40;
break;
}
//increase version if neededwhile (versionNumber < upperLimit && !willFit(getSize(), Version.getVersionForNumber(versionNumber), ecLevel)) {
versionNumber++;
}
//shrink version if possiblewhile (versionNumber > lowerLimit && willFit(getSize(), Version.getVersionForNumber(versionNumber - 1), ecLevel)) {
versionNumber--;
}
returnVersion.getVersionForNumber(versionNumber);
}
//TODO: change this method to be iterativepublicStringtoString() {
Stringresult = "";
if (declaresMode) {
result += mode + "(";
}
if (mode == Mode.ECI) {
result += encoders[charsetEncoderIndex].charset().displayName();
} else {
result += makePrintable(stringToEncode.substring(position,position + getEncodingGranularity(mode)));
}
if (next != null) {
result += (next.declaresMode ? ")," : ",") + next.toString();
} else {
result += ")";
}
returnresult;
}
StringmakePrintable(Strings) {
Stringresult = "";
for (inti = 0; i < s.length(); i++) {
if (s.charAt(i) < 32 || s.charAt(i) > 126) {
result += ".";
} else {
result += s.charAt(i);
}
}
returnresult;
}
}
/* static void testDecode(ResultNode rn) throws Exception { BitArray bits = new BitArray(); rn.getBits(bits); int size = bits.getSize(); byte[] bytes = new byte[size/8+1]; bits.toBytes(0,bytes,0,size/8); com.google.zxing.common.DecoderResult result = com.google.zxing.qrcode.decoder.DecodedBitStreamParser.decode(bytes,rn.getVersion(ErrorCorrectionLevel.L),null,new java.util.EnumMap<>(com.google.zxing.DecodeHintType.class)); }*///Everthing below this line is copied from com.google.zxing.qrcode.encoder.Encoder and can be removed if this class is put in the same package.//// The original table is defined in the table 5 of JISX0510:2004 (p.19).privatestaticfinalint[] ALPHANUMERIC_TABLE = {
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, // 0x00-0x0f
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, // 0x10-0x1f36, -1, -1, -1, 37, 38, -1, -1, -1, -1, 39, 40, -1, 41, 42, 43, // 0x20-0x2f0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 44, -1, -1, -1, -1, -1, // 0x30-0x3f
-1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, // 0x40-0x4f25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, -1, -1, -1, -1, -1, // 0x50-0x5f
};
/** * @return the code point of the table used in alphanumeric mode or * -1 if there is no corresponding code in the table. */staticintgetAlphanumericCode(intcode) {
if (code < ALPHANUMERIC_TABLE.length) {
returnALPHANUMERIC_TABLE[code];
}
return -1;
}
privatestaticbooleanisOnlyDoubleByteKanji(Stringcontent) {
byte[] bytes;
try {
bytes = content.getBytes("Shift_JIS");
} catch (UnsupportedEncodingExceptionignored) {
returnfalse;
}
intlength = bytes.length;
if (length % 2 != 0) {
returnfalse;
}
for (inti = 0; i < length; i += 2) {
intbyte1 = bytes[i] & 0xFF;
if ((byte1 < 0x81 || byte1 > 0x9F) && (byte1 < 0xE0 || byte1 > 0xEB)) {
returnfalse;
}
}
returntrue;
}
/* * Append "bytes" in "mode" mode (encoding) into "bits". On success, store the result in "bits". */staticvoidappendBytes(Stringcontent,
Modemode,
BitArraybits,
Stringencoding) throwsWriterException {
switch (mode) {
caseNUMERIC:
appendNumericBytes(content, bits);
break;
caseALPHANUMERIC:
appendAlphanumericBytes(content, bits);
break;
caseBYTE:
append8BitBytes(content, bits, encoding);
break;
caseKANJI:
appendKanjiBytes(content, bits);
break;
default:
thrownewWriterException("Invalid mode: " + mode);
}
}
staticvoidappendNumericBytes(CharSequencecontent, BitArraybits) {
intlength = content.length();
inti = 0;
while (i < length) {
intnum1 = content.charAt(i) - '0';
if (i + 2 < length) {
// Encode three numeric letters in ten bits.intnum2 = content.charAt(i + 1) - '0';
intnum3 = content.charAt(i + 2) - '0';
bits.appendBits(num1 * 100 + num2 * 10 + num3, 10);
i += 3;
} elseif (i + 1 < length) {
// Encode two numeric letters in seven bits.intnum2 = content.charAt(i + 1) - '0';
bits.appendBits(num1 * 10 + num2, 7);
i += 2;
} else {
// Encode one numeric letter in four bits.bits.appendBits(num1, 4);
i++;
}
}
}
staticvoidappendAlphanumericBytes(CharSequencecontent, BitArraybits) throwsWriterException {
intlength = content.length();
inti = 0;
while (i < length) {
intcode1 = getAlphanumericCode(content.charAt(i));
if (code1 == -1) {
thrownewWriterException();
}
if (i + 1 < length) {
intcode2 = getAlphanumericCode(content.charAt(i + 1));
if (code2 == -1) {
thrownewWriterException();
}
// Encode two alphanumeric letters in 11 bits.bits.appendBits(code1 * 45 + code2, 11);
i += 2;
} else {
// Encode one alphanumeric letter in six bits.bits.appendBits(code1, 6);
i++;
}
}
}
staticvoidappend8BitBytes(Stringcontent, BitArraybits, Stringencoding)
throwsWriterException {
byte[] bytes;
try {
bytes = content.getBytes(encoding);
} catch (UnsupportedEncodingExceptionuee) {
thrownewWriterException(uee);
}
for (byteb : bytes) {
bits.appendBits(b, 8);
}
}
staticvoidappendKanjiBytes(Stringcontent, BitArraybits) throwsWriterException {
byte[] bytes;
try {
bytes = content.getBytes("Shift_JIS");
} catch (UnsupportedEncodingExceptionuee) {
thrownewWriterException(uee);
}
if (bytes.length % 2 != 0) {
thrownewWriterException("Kanji byte size not even");
}
intmaxI = bytes.length - 1; // bytes.length must be evenfor (inti = 0; i < maxI; i += 2) {
intbyte1 = bytes[i] & 0xFF;
intbyte2 = bytes[i + 1] & 0xFF;
intcode = (byte1 << 8) | byte2;
intsubtracted = -1;
if (code >= 0x8140 && code <= 0x9ffc) {
subtracted = code - 0x8140;
} elseif (code >= 0xe040 && code <= 0xebbf) {
subtracted = code - 0xc140;
}
if (subtracted == -1) {
thrownewWriterException("Invalid byte sequence");
}
intencoded = ((subtracted >> 8) * 0xc0) + (subtracted & 0xff);
bits.appendBits(encoded, 13);
}
}
/** * @return true if the number of input bits will fit in a code with the specified version and * error correction level. */privatestaticbooleanwillFit(intnumInputBits, Versionversion, ErrorCorrectionLevelecLevel) {
// In the following comments, we use numbers of Version 7-H.// numBytes = 196intnumBytes = version.getTotalCodewords();
// getNumECBytes = 130Version.ECBlocksecBlocks = version.getECBlocksForLevel(ecLevel);
intnumEcBytes = ecBlocks.getTotalECCodewords();
// getNumDataBytes = 196 - 130 = 66intnumDataBytes = numBytes - numEcBytes;
inttotalInputBytes = (numInputBits + 7) / 8;
returnnumDataBytes >= totalInputBytes;
}
}
The text was updated successfully, but these errors were encountered:
I'd have to see this in a pull request, not pasted code.
It sounds like a good idea - does increase the complexity and runtime of encoding, but for a good reason. I think the question is how invasive your change is, when integrated with the existing code
Thanks for the quick response. If you agree then I will create the pull request firstly with the experimental code so that you can stop the process right at the beginning if you find it too invasive. If that is OK with you, then I will only then take care of the recursion issue, optimize the internal representation and do whatever other changes you think should be done.
So is a pull request for this coming? I think it would solve my issue which is zxing can't encode a smart health card correctly. The standard for smart health cards is very specific. Basically a smart health card looks like
shc:/numeric code for card
And the standard requires it to be two mode encoded as bytes for the shc:/ prefix and numeric for the rest. It looks to me like the smart encoder above would figure this out.
Today zxing simply encodes the whole lot as bytes, which doesn't work for any shc reader.
Yes, it is implemented. You have to activate it by using the encoding hint EncodeHintType.QR_COMPACT since it is not activated by default. I tested an example string "shc:/56762909524320603460292437404460293829382983923928398" and got the following result:
Minimal encoding of string "shc:/56762909524320603460292437404460293829382983923928398" requires 243 bits in version 2 and is encoded as follows:BYTE(shc:/),NUMERIC(56762909524320603460292437404460293829382983923928398)
Description:
Currently the data is encoded using a single mode so that for example the presence of a single nonnumeric character in an otherwise numeric input string makes the encoder use BYTE encoding in place of a combination of NUMERIC and ALPHANUMERIC or BYTE encoding.
As a result, the GS1 string 010950110153003171407021012345a is currently encoded using 248 bits in BYTE mode requiring a version 4 code whereas it could be encoded in 158 bits fitting in a smaller version 3 code.
A good implementation of compaction could also prevent users from having so specify the encoding via EncodeHintType.CHARACTER_SET. Instead, the algorithm could find the shortest byte representation for multi-language text that possibly uses multiple ECI switches. Examples below illustrate that.
Background:
I stumbled on this topic because I was asked to provide a fix that would allow to place more than one FNC1 in a GS1 QR-Code in order to make a particular scanner work.
It turns out that the request is invalid (unlike all other GS1 codes, GS1 QR-Code does not use FNC1 to delimit AIs) but a side effect of the work is that I implemented the multi-mode encoding so that it minimizes the size.
Make a pull request?
My question is now if a pull request with such code might be considered. In that case I would improve the code to not use input length proportional recursion and to use a leaner internal representation of a solution (possibly without any representation at all and then reconstructing the solution from the memoization array).
Algorithm proposed in the specification (ISO/IEC 18004):
Annex H of the QR-Code specification (titled "Optimization of bit stream length") contains a fast algorithm which is described by "may form the basis of one possible algorithm to determine the shortest bit stream for any given input data".
The speed and memory requirements of this algorithm probably can't be improved, and the implementation looks very straightforward. Moreover, the results are likely completely sufficient for all practical cases but they surely aren't minimal in all cases.
As an example, consider the input "123A". The algorithm from Annex H looks ahead only three characters and hence encodes as NUMERIC(123),BYTE(A) (4+10+10+4+8+8 = 44 bits in versions 1-9) while it could encode more compactly as ALPHANUMERIC(12,3A) (4+9+11+11 = 35 bits in versions 1-9).
Similarly, the string ABCD is minimally encoded as ALPHANUMERIC(AB,CD), while ABCDE is minimally encoded as BYTE(A,B,C,D,E) (solution wouldn't be found with a lookahead of 2) and ABCDEFG is minimally encoded as ALPHANUMERIC(AB,CD,EF),BYTE(G).
The algorithm also doesn't consider switching ECIs to minimize BYTE encoded text.
Example: The string "ŐŜ" is minimally encoded using UTF-8 as ECI(UTF-8),BYTE(Ő,Ŝ) while this string, with the same two characters "ŐŐŜŜ", is minimally encoded using the 2 ECI modes ISO-9959-2 and ISO-8859-6 as ECI(ISO-8859-2),BYTE(Ő,Ő),ECI(ISO-8859-3),BYTE(Ŝ,Ŝ) rather than with just one ECI to UTF-8 because these characters have double-byte representation in UTF-8.
Proposed alternative algorithm:
Alternatively, a greedy algorithm (Dijkstra, dynamic programming or divide and conquer with memoization) yields mathematically minimal solutions as the attached program demonstrates (uses DnC with memoization).
In zxing, minimal encoding should probably be activated on demand by an encoding hint because no matter how efficiently this is implemented, it will not be as efficient as the current code.
Another reason may be that some scanners may have problems reading the optimized codes. During my tests I observed scanning problems with a German application called bctester (It also has problems with some non-optimized codes) and an Android app called "Barcode Scanner". Reading back the images with Google zxing works fine but with the default recognition build into my vanilla Android phone's app, codes using ISO-8859-6 encoding for example do not display the characters correctly (I didn't test but I expect that to fail the same way if that encoding is set via EncodeHintType.CHARACTER_SET using the current implementation).
The class MinimalEncoder.java:
The code can be found at the bottom.
Producing test output:
The class contains a main method for testing. It reads space separated input strings from the command line, computes minimal representations and prints a string representation of each.
Invocation example:
Example of using the class to populate a BitArray with a minimal solution:
In the actual patch where this class is called from com.google.zxing.qrcode.encoder.Encoder there is an additional twist since the total size of the minimal encoding may not fit in the computed solution.
The function Encoder.willFit() is used to check this and by the following loop up to two additional calls to the minimal encoder are made to find a fitting version:
Best regards,
Alex
Source of MinimalEncoder.java
The text was updated successfully, but these errors were encountered: