Skip to content

Commit

Permalink
feat(spec/java): add strip flag in meta string encoding spec (#1565)
Browse files Browse the repository at this point in the history
## What does this PR do?

add strip flag in meta string encoding spec

## Related issues

#1540

## Does this PR introduce any user-facing change?

<!--
If any user-facing interface changes, please [open an
issue](https://github.com/apache/incubator-fury/issues/new/choose)
describing the need to do so and update the document if necessary.
-->

- [ ] Does this PR introduce any public API change?
- [ ] Does this PR introduce any binary protocol compatibility change?


## Benchmark

<!--
When the PR has an impact on performance (if you don't know whether the
PR will have an impact on performance, you can submit the PR first, and
if it will have impact on performance, the code reviewer will explain
it), be sure to attach a benchmark data here.
-->
  • Loading branch information
chaokunyang committed Apr 24, 2024
1 parent d3a7876 commit ba451c5
Show file tree
Hide file tree
Showing 8 changed files with 81 additions and 133 deletions.
10 changes: 5 additions & 5 deletions docs/specification/java_serialization_spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,11 +223,11 @@ Meta string is mainly used to encode meta strings such as class name and field n

String binary encoding algorithm:

| Algorithm | Pattern | Description |
|---------------------------|--------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101` |
| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9[c1,c2]` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `c1,c2`: `0b111110~0b111111`, `c1,c2` should be two of `._$` |
| UTF-8 | any chars | UTF-8 encoding |
| Algorithm | Pattern | Description |
|---------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._`: `0b111110~0b111111`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
| UTF-8 | any chars | UTF-8 encoding |

Encoding flags:

Expand Down
10 changes: 5 additions & 5 deletions docs/specification/xlang_serialization_spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,11 +338,11 @@ Meta string is mainly used to encode meta strings such as field names.

String binary encoding algorithm:

| Algorithm | Pattern | Description |
|---------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------|
| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101` |
| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._`: `0b111110~0b111111` |
| UTF-8 | any chars | UTF-8 encoding |
| Algorithm | Pattern | Description |
|---------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._`: `0b111110~0b111111`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
| UTF-8 | any chars | UTF-8 encoding |

Encoding flags:

Expand Down
40 changes: 15 additions & 25 deletions java/fury-core/src/main/java/org/apache/fury/meta/MetaString.java
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

import java.util.Arrays;
import java.util.Objects;
import org.apache.fury.util.Preconditions;

/**
* Represents a string with metadata that describes its encoding. It supports different encodings
Expand Down Expand Up @@ -61,31 +62,27 @@ public static Encoding fromInt(int value) {
private final char specialChar1;
private final char specialChar2;
private final byte[] bytes;
private final int numChars;
private final int numBits;
private final boolean stripLastChar;

/**
* Constructs a MetaString with the specified encoding and data.
*
* @param encoding The type of encoding used for the string data.
* @param bytes The encoded string data as a byte array.
* @param numBits The number of bits used for encoding.
*/
public MetaString(
String string,
Encoding encoding,
char specialChar1,
char specialChar2,
byte[] bytes,
int numChars,
int numBits) {
String string, Encoding encoding, char specialChar1, char specialChar2, byte[] bytes) {
this.string = string;
this.encoding = encoding;
this.specialChar1 = specialChar1;
this.specialChar2 = specialChar2;
this.bytes = bytes;
this.numChars = numChars;
this.numBits = numBits;
if (encoding != Encoding.UTF_8) {
Preconditions.checkArgument(bytes.length > 0);
this.stripLastChar = (bytes[0] & 0b1) != 0;
} else {
this.stripLastChar = false;
}
}

public String getString() {
Expand All @@ -108,12 +105,8 @@ public byte[] getBytes() {
return bytes;
}

public int getNumChars() {
return numChars;
}

public int getNumBits() {
return numBits;
public boolean stripLastChar() {
return stripLastChar;
}

@Override
Expand All @@ -127,15 +120,14 @@ public boolean equals(Object o) {
MetaString that = (MetaString) o;
return specialChar1 == that.specialChar1
&& specialChar2 == that.specialChar2
&& numChars == that.numChars
&& numBits == that.numBits
&& stripLastChar == that.stripLastChar
&& encoding == that.encoding
&& Arrays.equals(bytes, that.bytes);
}

@Override
public int hashCode() {
int result = Objects.hash(encoding, specialChar1, specialChar2, numChars, numBits);
int result = Objects.hash(encoding, specialChar1, specialChar2, stripLastChar);
result = 31 * result + Arrays.hashCode(bytes);
return result;
}
Expand All @@ -153,10 +145,8 @@ public String toString() {
+ specialChar2
+ ", bytes="
+ Arrays.toString(bytes)
+ ", numChars="
+ numChars
+ ", numBits="
+ numBits
+ ", stripLastChar="
+ stripLastChar
+ '}';
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -45,19 +45,18 @@ public MetaStringDecoder(char specialChar1, char specialChar2) {
*
* @param encodedData encoded data using passed <code>encoding</code>.
* @param encoding encoding the passed data.
* @param numBits total bits for encoded data.
* @return Decoded string.
*/
public String decode(byte[] encodedData, Encoding encoding, int numBits) {
public String decode(byte[] encodedData, Encoding encoding) {
switch (encoding) {
case LOWER_SPECIAL:
return decodeLowerSpecial(encodedData, numBits);
return decodeLowerSpecial(encodedData);
case LOWER_UPPER_DIGIT_SPECIAL:
return decodeLowerUpperDigitSpecial(encodedData, numBits);
return decodeLowerUpperDigitSpecial(encodedData);
case FIRST_TO_LOWER_SPECIAL:
return decodeRepFirstLowerSpecial(encodedData, numBits);
return decodeRepFirstLowerSpecial(encodedData);
case ALL_TO_LOWER_SPECIAL:
return decodeRepAllToLowerSpecial(encodedData, numBits);
return decodeRepAllToLowerSpecial(encodedData);
case UTF_8:
return new String(encodedData, StandardCharsets.UTF_8);
default:
Expand All @@ -66,30 +65,36 @@ public String decode(byte[] encodedData, Encoding encoding, int numBits) {
}

/** Decoding method for {@link Encoding#LOWER_SPECIAL}. */
private String decodeLowerSpecial(byte[] data, int numBits) {
private String decodeLowerSpecial(byte[] data) {
StringBuilder decoded = new StringBuilder();
int bitIndex = 0;
int bitMask = 0b11111; // 5 bits for mask
while (bitIndex + 5 <= numBits) {
int totalBits = data.length * 8; // Total number of bits in the data
boolean stripLastChar = (data[0] & 0x80) != 0; // Check the first bit of the first byte
int bitMask = 0b11111; // 5 bits for the mask
int bitIndex = 1; // Start from the second bit
while (bitIndex + 5 <= totalBits) {
int byteIndex = bitIndex / 8;
int intraByteIndex = bitIndex % 8;
// Extract the 5-bit character value across byte boundaries if needed
int charValue =
((data[byteIndex] & 0xFF) << 8)
| (byteIndex + 1 < data.length ? (data[byteIndex + 1] & 0xFF) : 0);
charValue = ((byte) ((charValue >> (11 - intraByteIndex)) & bitMask));
charValue = (byte) ((charValue >> (11 - intraByteIndex)) & bitMask);
bitIndex += 5;
decoded.append(decodeLowerSpecialChar(charValue));
}

if (stripLastChar) {
decoded.deleteCharAt(decoded.length() - 1);
}
return decoded.toString();
}

/** Decoding method for {@link Encoding#LOWER_UPPER_DIGIT_SPECIAL}. */
private String decodeLowerUpperDigitSpecial(byte[] data, int numBits) {
private String decodeLowerUpperDigitSpecial(byte[] data) {
StringBuilder decoded = new StringBuilder();
int bitIndex = 0;
int bitIndex = 1;
boolean stripLastChar = (data[0] & 0x80) != 0; // Check the first bit of the first byte
int bitMask = 0b111111; // 6 bits for mask
int numBits = data.length * 8;
while (bitIndex + 6 <= numBits) {
int byteIndex = bitIndex / 8;
int intraByteIndex = bitIndex % 8;
Expand All @@ -102,6 +107,9 @@ private String decodeLowerUpperDigitSpecial(byte[] data, int numBits) {
bitIndex += 6;
decoded.append(decodeLowerUpperDigitSpecialChar(charValue));
}
if (stripLastChar) {
decoded.deleteCharAt(decoded.length() - 1);
}
return decoded.toString();
}

Expand Down Expand Up @@ -140,13 +148,13 @@ private char decodeLowerUpperDigitSpecialChar(int charValue) {
}
}

private String decodeRepFirstLowerSpecial(byte[] data, int numBits) {
String str = decodeLowerSpecial(data, numBits);
private String decodeRepFirstLowerSpecial(byte[] data) {
String str = decodeLowerSpecial(data);
return StringUtils.capitalize(str);
}

private String decodeRepAllToLowerSpecial(byte[] data, int numBits) {
String str = decodeLowerSpecial(data, numBits);
private String decodeRepAllToLowerSpecial(byte[] data) {
String str = decodeLowerSpecial(data);
StringBuilder builder = new StringBuilder();
char[] chars = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,7 @@ public MetaStringEncoder(char specialChar1, char specialChar2) {
*/
public MetaString encode(String input) {
if (input.isEmpty()) {
return new MetaString(
input, Encoding.LOWER_SPECIAL, specialChar1, specialChar2, new byte[0], 0, 0);
return new MetaString(input, Encoding.UTF_8, specialChar1, specialChar2, new byte[0]);
}
Encoding encoding = computeEncoding(input);
return encode(input, encoding);
Expand All @@ -66,53 +65,27 @@ public MetaString encode(String input, Encoding encoding) {
Preconditions.checkArgument(
input.length() < Short.MAX_VALUE, "Long meta string than 32767 is not allowed");
if (input.isEmpty()) {
return new MetaString(
input, Encoding.LOWER_SPECIAL, specialChar1, specialChar2, new byte[0], 0, 0);
return new MetaString(input, Encoding.UTF_8, specialChar1, specialChar2, new byte[0]);
}
int length = input.length();
byte[] bytes;
switch (encoding) {
case LOWER_SPECIAL:
return new MetaString(
input,
encoding,
specialChar1,
specialChar2,
encodeLowerSpecial(input),
length,
length * 5);
bytes = encodeLowerSpecial(input);
return new MetaString(input, encoding, specialChar1, specialChar2, bytes);
case LOWER_UPPER_DIGIT_SPECIAL:
return new MetaString(
input,
encoding,
specialChar1,
specialChar2,
encodeLowerUpperDigitSpecial(input),
length,
length * 6);
bytes = encodeLowerUpperDigitSpecial(input);
return new MetaString(input, encoding, specialChar1, specialChar2, bytes);
case FIRST_TO_LOWER_SPECIAL:
return new MetaString(
input,
encoding,
specialChar1,
specialChar2,
encodeFirstToLowerSpecial(input),
length,
length * 5);
bytes = encodeFirstToLowerSpecial(input);
return new MetaString(input, encoding, specialChar1, specialChar2, bytes);
case ALL_TO_LOWER_SPECIAL:
char[] chars = input.toCharArray();
int upperCount = countUppers(chars);
return new MetaString(
input,
encoding,
specialChar1,
specialChar2,
encodeAllToLowerSpecial(chars, upperCount),
length,
(upperCount + length) * 5);
bytes = encodeAllToLowerSpecial(chars, upperCount);
return new MetaString(input, encoding, specialChar1, specialChar2, bytes);
default:
byte[] bytes = input.getBytes(StandardCharsets.UTF_8);
return new MetaString(
input, Encoding.UTF_8, specialChar1, specialChar2, bytes, bytes.length * 8, 0);
bytes = input.getBytes(StandardCharsets.UTF_8);
return new MetaString(input, Encoding.UTF_8, specialChar1, specialChar2, bytes);
}
}

Expand Down Expand Up @@ -238,10 +211,10 @@ private byte[] encodeGeneric(String input, int bitsPerChar) {
}

private byte[] encodeGeneric(char[] chars, int bitsPerChar) {
int totalBits = chars.length * bitsPerChar;
int totalBits = chars.length * bitsPerChar + 1;
int byteLength = (totalBits + 7) / 8; // Calculate number of needed bytes
byte[] bytes = new byte[byteLength];
int currentBit = 0;
int currentBit = 1;
for (char c : chars) {
int value =
(bitsPerChar == 5) ? charToValueLowerSpecial(c) : charToValueLowerUpperDigitSpecial(c);
Expand All @@ -256,7 +229,10 @@ private byte[] encodeGeneric(char[] chars, int bitsPerChar) {
currentBit++;
}
}

boolean stripLastChar = bytes.length * 8 >= totalBits + bitsPerChar;
if (stripLastChar) {
bytes[0] = (byte) (bytes[0] | 0x80);
}
return bytes;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@

@Internal
final class MetaStringBytes {
static final int STRIP_LAST_CHAR = 0b1000;
static final short DEFAULT_DYNAMIC_WRITE_STRING_ID = -1;

final byte[] bytes;
Expand Down Expand Up @@ -57,25 +56,14 @@ public MetaStringBytes(MetaString metaString) {
}
hashCode &= 0xffffffffffffff00L;
int header = metaString.getEncoding().getValue();
String decoded =
new MetaStringDecoder(metaString.getSpecialChar1(), metaString.getSpecialChar2())
.decode(bytes, metaString.getEncoding(), bytes.length * 8);
if (decoded.length() > metaString.getString().length()) {
header |= STRIP_LAST_CHAR;
}
this.hashCode = hashCode | header;
}

public String decode(char specialChar1, char specialChar2) {
int header = (int) (hashCode & 0xff);
int encodingFlags = header & 0b111;
MetaString.Encoding encoding = MetaString.Encoding.values()[encodingFlags];
String str =
new MetaStringDecoder(specialChar1, specialChar2).decode(bytes, encoding, bytes.length * 8);
if ((header & STRIP_LAST_CHAR) != 0) {
str = str.substring(0, str.length() - 1);
}
return str;
return new MetaStringDecoder(specialChar1, specialChar2).decode(bytes, encoding);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,6 @@
* share common immutable datastructure globally across multiple fury.
*/
public final class MetaStringResolver {
public static final byte USE_STRING_VALUE = 0;
public static final byte USE_STRING_ID = 1;
private static final int initialCapacity = 8;
// use a lower load factor to minimize hash collision
private static final float furyMapLoadFactor = 0.25f;
Expand Down

0 comments on commit ba451c5

Please sign in to comment.