Java中的稀疏矩阵/数组

我正在开发一个用Java编写的项目,该项目要求我建立一个非常大的2-D稀疏数组。非常稀疏,如果有所作为。无论如何:此应用程序最关键的方面是时间方面的效率(假定内存负载,尽管没有那么无限的限制,以至于我无法使用标准的2D数组-关键范围在两个维度上都在数十亿之内)。

在数组中的千亿个单元中,将有数十万个单元包含一个对象。我需要能够非常快速地修改单元格内容。

无论如何:有人知道为此目的特别好的图书馆吗?它必须是伯克利,LGPL或类似的许可证(没有GPL,因为该产品不能完全开源)。或者,如果只有一种非常简单的方法来制作自制的稀疏数组对象,那也可以。

我正在考虑MTJ,但尚未听到有关其质量的任何意见。

回答:

使用散列图构建的稀疏数组对于频繁读取的数据效率很低。最有效的实现方式是使用Trie,该Trie允许访问分布有段的单个向量。

Trie可以通过仅执行只读两个数组索引来获取元素存储的有效位置,或知道基础存储中是否不存在元素,从而计算表中是否存在元素。

它也可以为稀疏数组的默认值在后备存储中提供默认位置,因此你不需要对返回的索引进行任何测试,因为Trie保证所有可能的源索引都将至少映射到默认值在后备存储中的位置(你经常会在其中存储零,空字符串或空对象)。

存在支持快速更新Tries的实现,并通过一个“ compact()”操作在多个操作结束时优化后备存储的大小。尝试比哈希映射要快得多,因为它们不需要任何复杂的哈希函数,并且不需要处理读取冲突(对于哈希映射,读取和写入都具有冲突,这需要循环以跳至下一个候选职位,并对其进行测试以比较有效来源索引…)

另外,Java Hashmaps只能在对象上建立索引,并且为每个散列的源索引创建一个Integer对象(每次读取都需要创建该对象,而不仅仅是写入)在内存操作方面是昂贵的,因为它强调了垃圾收集器。

我真的希望JRE包括IntegerTrieMap 作为慢速HashMap 的默认实现或LongTrieMap 作为慢速HashMap 的默认实现…但这是仍然不是这样。

你可能想知道什么是特里?

它只是一小部分整数数组(比矩阵的整个坐标范围小),可以将坐标映射到向量中的整数位置。

例如,假设你想要一个1024 * 1024矩阵,其中仅包含一些非零值。与其将矩阵存储在包含1024 * 1024个元素(超过100万个)的数组中,不如将其拆分为大小为16 * 16的子范围,而只需要64 * 64个这样的子范围即可。

在这种情况下,Trie索引将仅包含64 * 64的整数(4096),并且将至少有16 * 16的数据元素(包含默认零或稀疏矩阵中最常见的子范围)。

并且用于存储值的向量对于彼此相等的子范围将仅包含1个副本(其中大多数为零,它们将由相同的子范围表示)。

因此matrix[i][j],你可以使用类似以下语法,而不是使用类似的语法:

trie.values[trie.subrangePositions[(i & ~15) + (j >> 4)] +

((i & 15) << 4) + (j & 15)]

使用trie对象的访问方法将可以更方便地处理它。

这是一个内置在注释类中的示例(我希望它可以编译成OK(简化);如果有要纠正的错误,请通知我):

/**

* Implement a sparse matrix. Currently limited to a static size

* (<code>SIZE_I</code>, <code>SIZE_I</code>).

*/

public class DoubleTrie {

/* Matrix logical options */

public static final int SIZE_I = 1024;

public static final int SIZE_J = 1024;

public static final double DEFAULT_VALUE = 0.0;

/* Internal splitting options */

private static final int SUBRANGEBITS_I = 4;

private static final int SUBRANGEBITS_J = 4;

/* Internal derived splitting constants */

private static final int SUBRANGE_I =

1 << SUBRANGEBITS_I;

private static final int SUBRANGE_J =

1 << SUBRANGEBITS_J;

private static final int SUBRANGEMASK_I =

SUBRANGE_I - 1;

private static final int SUBRANGEMASK_J =

SUBRANGE_J - 1;

private static final int SUBRANGE_POSITIONS =

SUBRANGE_I * SUBRANGE_J;

/* Internal derived default values for constructors */

private static final int SUBRANGES_I =

(SIZE_I + SUBRANGE_I - 1) / SUBRANGE_I;

private static final int SUBRANGES_J =

(SIZE_J + SUBRANGE_J - 1) / SUBRANGE_J;

private static final int SUBRANGES =

SUBRANGES_I * SUBRANGES_J;

private static final int DEFAULT_POSITIONS[] =

new int[SUBRANGES](0);

private static final double DEFAULT_VALUES[] =

new double[SUBRANGE_POSITIONS](DEFAULT_VALUE);

/* Internal fast computations of the splitting subrange and offset. */

private static final int subrangeOf(

final int i, final int j) {

return (i >> SUBRANGEBITS_I) * SUBRANGE_J +

(j >> SUBRANGEBITS_J);

}

private static final int positionOffsetOf(

final int i, final int j) {

return (i & SUBRANGEMASK_I) * MAX_J +

(j & SUBRANGEMASK_J);

}

/**

* Utility missing in java.lang.System for arrays of comparable

* component types, including all native types like double here.

*/

public static final int arraycompare(

final double[] values1, final int position1,

final double[] values2, final int position2,

final int length) {

if (position1 >= 0 && position2 >= 0 && length >= 0) {

while (length-- > 0) {

double value1, value2;

if ((value1 = values1[position1 + length]) !=

(value2 = values2[position2 + length])) {

/* Note: NaN values are different from everything including

* all Nan values; they are are also neigher lower than nor

* greater than everything including NaN. Note that the two

* infinite values, as well as denormal values, are exactly

* ordered and comparable with <, <=, ==, >=, >=, !=. Note

* that in comments below, infinite is considered "defined".

*/

if (value1 < value2)

return -1; /* defined < defined. */

if (value1 > value2)

return 1; /* defined > defined. */

if (value1 == value2)

return 0; /* defined == defined. */

/* One or both are NaN. */

if (value1 == value1) /* Is not a NaN? */

return -1; /* defined < NaN. */

if (value2 == value2) /* Is not a NaN? */

return 1; /* NaN > defined. */

/* Otherwise, both are NaN: check their precise bits in

* range 0x7FF0000000000001L..0x7FFFFFFFFFFFFFFFL

* including the canonical 0x7FF8000000000000L, or in

* range 0xFFF0000000000001L..0xFFFFFFFFFFFFFFFFL.

* Needed for sort stability only (NaNs are otherwise

* unordered).

*/

long raw1, raw2;

if ((raw1 = Double.doubleToRawLongBits(value1)) !=

(raw2 = Double.doubleToRawLongBits(value2)))

return raw1 < raw2 ? -1 : 1;

/* Otherwise the NaN are strictly equal, continue. */

}

}

return 0;

}

throw new ArrayIndexOutOfBoundsException(

"The positions and length can't be negative");

}

/**

* Utility shortcut for comparing ranges in the same array.

*/

public static final int arraycompare(

final double[] values,

final int position1, final int position2,

final int length) {

return arraycompare(values, position1, values, position2, length);

}

/**

* Utility missing in java.lang.System for arrays of equalizable

* component types, including all native types like double here.

*/

public static final boolean arrayequals(

final double[] values1, final int position1,

final double[] values2, final int position2,

final int length) {

return arraycompare(values1, position1, values2, position2, length) ==

0;

}

/**

* Utility shortcut for identifying ranges in the same array.

*/

public static final boolean arrayequals(

final double[] values,

final int position1, final int position2,

final int length) {

return arrayequals(values, position1, values, position2, length);

}

/**

* Utility shortcut for copying ranges in the same array.

*/

public static final void arraycopy(

final double[] values,

final int srcPosition, final int dstPosition,

final int length) {

arraycopy(values, srcPosition, values, dstPosition, length);

}

/**

* Utility shortcut for resizing an array, preserving values at start.

*/

public static final double[] arraysetlength(

double[] values,

final int newLength) {

final int oldLength =

values.length < newLength ? values.length : newLength;

System.arraycopy(values, 0, values = new double[newLength], 0,

oldLength);

return values;

}

/* Internal instance members. */

private double values[];

private int subrangePositions[];

private bool isSharedValues;

private bool isSharedSubrangePositions;

/* Internal method. */

private final reset(

final double[] values,

final int[] subrangePositions) {

this.isSharedValues =

(this.values = values) == DEFAULT_VALUES;

this.isSharedsubrangePositions =

(this.subrangePositions = subrangePositions) ==

DEFAULT_POSITIONS;

}

/**

* Reset the matrix to fill it with the same initial value.

*

* @param initialValue The value to set in all cell positions.

*/

public reset(final double initialValue = DEFAULT_VALUE) {

reset(

(initialValue == DEFAULT_VALUE) ? DEFAULT_VALUES :

new double[SUBRANGE_POSITIONS](initialValue),

DEFAULT_POSITIONS);

}

/**

* Default constructor, using single default value.

*

* @param initialValue Alternate default value to initialize all

* positions in the matrix.

*/

public DoubleTrie(final double initialValue = DEFAULT_VALUE) {

this.reset(initialValue);

}

/**

* This is a useful preinitialized instance containing the

* DEFAULT_VALUE in all cells.

*/

public static DoubleTrie DEFAULT_INSTANCE = new DoubleTrie();

/**

* Copy constructor. Note that the source trie may be immutable

* or not; but this constructor will create a new mutable trie

* even if the new trie initially shares some storage with its

* source when that source also uses shared storage.

*/

public DoubleTrie(final DoubleTrie source) {

this.values = (this.isSharedValues =

source.isSharedValues) ?

source.values :

source.values.clone();

this.subrangePositions = (this.isSharedSubrangePositions =

source.isSharedSubrangePositions) ?

source.subrangePositions :

source.subrangePositions.clone());

}

/**

* Fast indexed getter.

*

* @param i Row of position to set in the matrix.

* @param j Column of position to set in the matrix.

* @return The value stored in matrix at that position.

*/

public double getAt(final int i, final int j) {

return values[subrangePositions[subrangeOf(i, j)] +

positionOffsetOf(i, j)];

}

/**

* Fast indexed setter.

*

* @param i Row of position to set in the sparsed matrix.

* @param j Column of position to set in the sparsed matrix.

* @param value The value to set at this position.

* @return The passed value.

* Note: this does not compact the sparsed matric after setting.

* @see compact(void)

*/

public double setAt(final int i, final int i, final double value) {

final int subrange = subrangeOf(i, j);

final int positionOffset = positionOffsetOf(i, j);

// Fast check to see if the assignment will change something.

int subrangePosition, valuePosition;

if (Double.compare(

values[valuePosition =

(subrangePosition = subrangePositions[subrange]) +

positionOffset],

value) != 0) {

/* So we'll need to perform an effective assignment in values.

* Check if the current subrange to assign is shared of not.

* Note that we also include the DEFAULT_VALUES which may be

* shared by several other (not tested) trie instances,

* including those instanciated by the copy contructor. */

if (isSharedValues) {

values = values.clone();

isSharedValues = false;

}

/* Scan all other subranges to check if the position in values

* to assign is shared by another subrange. */

for (int otherSubrange = subrangePositions.length;

--otherSubrange >= 0; ) {

if (otherSubrange != subrange)

continue; /* Ignore the target subrange. */

/* Note: the following test of range is safe with future

* interleaving of common subranges (TODO in compact()),

* even though, for now, subranges are sharing positions

* only between their common start and end position, so we

* could as well only perform the simpler test <code>

* (otherSubrangePosition == subrangePosition)</code>,

* instead of testing the two bounds of the positions

* interval of the other subrange. */

int otherSubrangePosition;

if ((otherSubrangePosition =

subrangePositions[otherSubrange]) >=

valuePosition &&

otherSubrangePosition + SUBRANGE_POSITIONS <

valuePosition) {

/* The target position is shared by some other

* subrange, we need to make it unique by cloning the

* subrange to a larger values vector, copying all the

* current subrange values at end of the new vector,

* before assigning the new value. This will require

* changing the position of the current subrange, but

* before doing that, we first need to check if the

* subrangePositions array itself is also shared

* between instances (including the DEFAULT_POSITIONS

* that should be preserved, and possible arrays

* shared by an external factory contructor whose

* source trie was declared immutable in a derived

* class). */

if (isSharedSubrangePositions) {

subrangePositions = subrangePositions.clone();

isSharedSubrangePositions = false;

}

/* TODO: no attempt is made to allocate less than a

* fully independant subrange, using possible

* interleaving: this would require scanning all

* other existing values to find a match for the

* modified subrange of values; but this could

* potentially leave positions (in the current subrange

* of values) unreferenced by any subrange, after the

* change of position for the current subrange. This

* scanning could be prohibitively long for each

* assignement, and for now it's assumed that compact()

* will be used later, after those assignements. */

values = setlengh(

values,

(subrangePositions[subrange] =

subrangePositions = values.length) +

SUBRANGE_POSITIONS);

valuePosition = subrangePositions + positionOffset;

break;

}

}

/* Now perform the effective assignment of the value. */

values[valuePosition] = value;

}

}

return value;

}

/**

* Compact the storage of common subranges.

* TODO: This is a simple implementation without interleaving, which

* would offer a better data compression. However, interleaving with its

* O(N²) complexity where N is the total length of values, should

* be attempted only after this basic compression whose complexity is

* O(n²) with n being SUBRANGE_POSITIIONS times smaller than N.

*/

public void compact() {

final int oldValuesLength = values.length;

int newValuesLength = 0;

for (int oldPosition = 0;

oldPosition < oldValuesLength;

oldPosition += SUBRANGE_POSITIONS) {

int oldPosition = positions[subrange];

bool commonSubrange = false;

/* Scan values for possible common subranges. */

for (int newPosition = newValuesLength;

(newPosition -= SUBRANGE_POSITIONS) >= 0; )

if (arrayequals(values, newPosition, oldPosition,

SUBRANGE_POSITIONS)) {

commonSubrange = true;

/* Update the subrangePositions|] with all matching

* positions from oldPosition to newPosition. There may

* be several index to change, if the trie has already

* been compacted() before, and later reassigned. */

for (subrange = subrangePositions.length;

--subrange >= 0; )

if (subrangePositions[subrange] == oldPosition)

subrangePositions[subrange] = newPosition;

break;

}

if (!commonSubrange) {

/* Move down the non-common values, if some previous

* subranges have been compressed when they were common.

*/

if (!commonSubrange && oldPosition != newValuesLength) {

arraycopy(values, oldPosition, newValuesLength,

SUBRANGE_POSITIONS);

/* Advance compressed values to preserve these new ones. */

newValuesLength += SUBRANGE_POSITIONS;

}

}

}

/* Check the number of compressed values. */

if (newValuesLength < oldValuesLength) {

values = values.arraysetlength(newValuesLength);

isSharedValues = false;

}

}

}

注意:此代码不完整,因为它只能处理单个矩阵大小,并且其压缩器仅限于仅检测公共子范围而不进行交错。

另外,代码不会根据矩阵大小确定将矩阵划分为子范围(用于x或y坐标)的最佳宽度或高度。它仅使用相同的静态子范围大小16(对于两个坐标),但可以方便地使用其他任何2的小数幂(但是如果不使用2的幂,则会降低int indexOf(int, int)和int offsetOf(int, int)内部方法的速度),对于两个坐标和向上理想情况下,该compact()方法应该能够确定最佳的拟合尺寸。

如果这些分裂子区域的大小可以改变,那么将有必要添加实例成员对这些小范围的大小,而不是静态的SUBRANGE_POSITIONS,并且使静态方法int subrangeOf(int i, int j)int positionOffsetOf(int i, int j)成非静态的; 和初始化数组DEFAULT_POSITIONS,DEFAULT_VALUES则需要将其删除或重新定义。

如果要支持交织,则基本上是从将现有值分成两个大小相同的值开始(两者都是最小子范围大小的倍数,第一个子集可能比第二个子集多一个子范围),以及你将在所有连续位置扫描较大的一个,以找到匹配的交错;那么你将尝试匹配这些值。然后,将子集分成两半(也是最小子范围大小的倍数),以递归循环,然后再次扫描以匹配这些子集(这会将子集的数量乘以2:与当前值的大小相比,subrangePositions索引的大小值得该值,以查看其是否提供有效的压缩(如果没有,则在此处停止:你直接从交错压缩过程中找到了最佳子范围大小)。在这种情况下; 在压实期间,子范围的大小将是可变的。

但是这段代码显示了如何分配非零值并data为其他(非零)子范围重新分配数组,然后当存在重复项时如何优化(compact()使用该setAt(int i, int j, double value)方法执行分配后)此数据的存储可以在数据中统一的子范围,并在subrangePositions数组中的同一位置重新索引。

无论如何,特里的所有原理都在这里实现:

使用单个向量而不是双索引数组(每个数组单独分配)来表示矩阵总是更快(并且在内存中更紧凑,意味着更好的局部性)。改进在double getAt(int, int)方法中可见!

你节省了大量空间,但是在分配值时,可能需要一些时间重新分配新的子范围。因此,子范围不应太小,否则重新设置对于设置矩阵而言会过于频繁。

通过检测公共子范围,可以将初始的大矩阵自动转换为更紧凑的矩阵。然后,典型的实现将包含上述方法compact()。但是,如果get()访问非常快而set()相当快,则如果有很多要压缩的常见子范围(例如,用自身减去大型非稀疏随机填充矩阵时),compact()可能会非常慢。 ,或将其乘以零:在这种情况下,通过实例化一个新的并丢弃旧的来重置Trie会更简单,更快。

公用子范围在数据中使用公用存储,因此此共享数据必须是只读的。如果必须更改单个值而不更改矩阵的其余部分,则必须首先确保在subrangePositions索引中仅对其进行一次引用。否则,你将需要在values向量的任意位置(通常在末尾)分配一个新的子范围,然后将该新子范围的位置存储到subrangePositions索引中。

请注意,通用Colt库虽然很好,但在处理稀疏矩阵时却不如它好,因为它使用了哈希(或行压缩)技术,尽管它是一个出色的优化,这既是节省空间和节省时间的,特别是对于最常见的getAt()操作。

即使是此处描述的setAt()操作也可以节省大量时间(此处实现的方式,即在设置后无需自动压缩,仍然可以根据需求和估计时间来实现,其中压缩仍将节省大量存储空间时间价格):节省的时间与子范围内的单元格数量成正比,而节省的空间则与每个子范围内的单元格数量成反比。如果要使用子范围大小,则最好妥协,例如每个子范围的像元数是2D矩阵中像元总数的平方根(在使用3D矩阵时,它将是立方根)。

在Colt稀疏矩阵实现中使用的哈希技术的不便之处在于,它们增加了很多存储开销,并且由于可能的冲突而导致访问时间变慢。尝试可以避免所有冲突,然后可以保证在最坏的情况下将线性O(n)时间节省为O(1)时间,其中(n)是可能发生冲突的次数(在稀疏矩阵的情况下,可能是对于非稀疏(即完整矩阵),最大为矩阵中非默认值像元的数量,即最大为矩阵大小的总数乘以与哈希填充因子成比例的因子)。

在Colt中使用的RC(行压缩)技术距离Tries较近,但这是另一种代价,这里使用的压缩技术对最频繁的只读get()操作具有非常慢的访问时间,并且非常慢setAt()操作的压缩。此外,所使用的压缩不是正交的,这与Tries的这种保留正交性的表示方式不同。对于相关的查看操作(例如跨步,换位(基于整数循环模块化操作的跨步操作),细分(通常包括子视图,包括子视图)),也将尝试保留这种正交性。

我只是希望柯尔特将来会更新,以使用Tries实现另一种实现(即TrieSparseMatrix而不是HashSparseMatrix和RCSparseMatrix)。这些想法在本文中。

Trove实现(基于int-> int映射)也基于类似于Colt的HashedSparseMatrix的哈希技术,即它们具有相同的不便之处。尝试会快很多,而且会占用适度的额外空间(但是可以对最终的矩阵/ trie使用最终的compact()ion操作,在延迟的时间内优化并变得比Trove和Colt更好)。

注意:此Trie实现绑定到特定的本机类型(此处为double)。这是自愿的,因为使用装箱类型的通用实现具有巨大的空间开销(并且访问时间要慢得多)。在这里,它仅使用双精度的原生一维数组,而不是通用Vector。但是,当然也可以为Tries派生通用实现…不幸的是,Java仍然不允许编写具有本机类型所有优点的真正通用类,除非编写多个实现(对于通用Object类型或每个通用类型)。本机类型),并通过类型工厂为所有这些操作提供服务。该语言应该能够自动实例化本机实现并自动构建工厂(目前,即使在Java 7中也不是这种情况,这是在其中。

以上是 Java中的稀疏矩阵/数组 的全部内容, 来源链接: utcz.com/qa/419899.html

关于【Z时代】

UTC,指协调世界时,又称世界统一时间、世界标准时间,Zulu Time指祖鲁时间,UTCZ由此而来。 她提供标准北京时间查询校准服务,同时也是一个集办公软件、常用工具、科技产品、智能电子设备、游戏 周边、IT技术、生活常识等综合型百科知识分享科普宣传网站。

联系我们

商务合作:itysh8@gmail.com

联系站长:itysh8@gmail.com

投诉建议:aiasus2010@gmail.com

手机站点二维码

扫码手机访问

回到顶部