Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.ArrayIndexOutOfBoundsException #3

Open
ludwig-nc opened this issue Nov 21, 2018 · 4 comments
Open

java.lang.ArrayIndexOutOfBoundsException #3

ludwig-nc opened this issue Nov 21, 2018 · 4 comments

Comments

@ludwig-nc
Copy link

I am running on Spark 2.1.0 and I could compile the library without problems. I could run the examples and a smaller subset of my Problem (10,000 data points). When I increase the problem size (100,000 data points) I get the following error. Any ideas? Do the data Point IDs need to be continuous from 0 to len(data_points)? I used hashed values for the IDs and it almost look like an array value is accessed by the ID.

java.lang.ArrayIndexOutOfBoundsException: 33567921
at org.apache.spark.graphx.impl.EdgePartition.aggregateMessagesEdgeScan(EdgePartition.scala:395)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$13$$anonfun$apply$3.apply(GraphImpl.scala:237)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$13$$anonfun$apply$3.apply(GraphImpl.scala:207)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.scheduler.Task.run(Task.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

@viirya
Copy link
Owner

viirya commented Dec 23, 2018

What do you mean you hashed values for the IDs?

@ludwig-nc
Copy link
Author

The IDs for my elements are strings but the framework is using longs for IDs. I map all my strings to longs by using the first 8 bytes of the md5-hash as an integer. This results in ID numbers that are non-continuous, which may or may not be a reason for problems. In short, does the framework assume that the IDs are continuous from 0 to N, where N is the number of data points?

@540667387
Copy link

Did the author find out the reason? I think I made the same mistake

@viirya
Copy link
Owner

viirya commented May 30, 2021

Hmm, I don't think this requires continuous ids. As you can see the input similarities are RDD[(Long, Long, Double), and the similarity matrix could be sparse.

Could you provide complete stack trace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants