Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⬆️ [server] Upgrade to the latest everything #856

Closed
shankari opened this issue Feb 7, 2023 · 46 comments
Closed

⬆️ [server] Upgrade to the latest everything #856

shankari opened this issue Feb 7, 2023 · 46 comments

Comments

@shankari
Copy link
Contributor

shankari commented Feb 7, 2023

This is tracking the server upgrade to the most recent version of everything.
This will involve upgrade miniconda, and all the dependencies
The versions are all defined in files in setup

After upgrade, make sure that all the unit tests pass, and some basic end-to-end tests from the phone to the server still work
If you wanted to clean up the setup directory to remove obsolete files (that have not been edited for ~ 3 years) that would be great as well.

@shankari
Copy link
Contributor Author

shankari commented Feb 7, 2023

Looks like the last full upgrade was
e-mission/e-mission-server#761

The big part of the work took ~ 2 days but the final commit was closer to a week later.
There were probably gaps in the middle when I worked on other things as well though.

@swastis10
Copy link

swastis10 commented Feb 15, 2023

While upgrading PyMongo from 3.11.0 to 4.3.3, I have been getting these errors:
ValueError: cannot encode native uuid.UUID with UuidRepresentation.UNSPECIFIED. UUIDs can be manually converted to bson.Binary instances using bson.Binary.from_uuid() or a different UuidRepresentation can be configured. See the documentation for UuidRepresentation for more information.

The solution for it can be found here: https://pymongo.readthedocs.io/en/stable/examples/uuid.html#handling-uuid-data-example

We are using 'python_legacy' instead of 'standard' UUID representation here because standard UUID representation should be used by new applications or applications that are encoding and/or decoding UUIDs in MongoDB for the first time. Python_legacy should be used by already existing applications using python drivers to encode and/or decode.

Ours is a python application that has already been storing data in databases for many months. It is NOT "encoding or decoding for the first time". If we used standard, although the unit tests might run, it would break in production.

@swastis10
Copy link

After handling UUID representation using 'python_legacy', numerous test cases have started failing. Some of the failing test case errors are :
FAIL: testGetTimeRangeForTrip (emission.tests.storageTests.TestTripQueries.TestTripQueries) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/storageTests/TestTripQueries.py", line 56, in testGetTimeRangeForTrip new_trip = self.create_fake_trip() File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/storageTests/TestTripQueries.py", line 53, in create_fake_trip return etsa.createNewTripLike(self, esda.RAW_TRIP_KEY, ecwrt.Rawtrip) File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/storageTests/analysis_ts_common.py", line 32, in createNewTripLike utest.assertEqual(new_trip_entry.user_id, utest.testUserId) AssertionError: UUID('3f205663-671b-3333-91d8-347d74a88f83') != Binary(b'? Vcg\x1b33\x91\xd84}t\xa8\x8f\x83', 3)

As we can see the last line of the error, one user_id is binary and another one is not. Therefore, I believe I am not handling UUID representation somewhere

@shankari
Copy link
Contributor Author

@swastis10 have you tried deleting and recreating your database? If there are any pending binary entries from your prior experiment with setting it to STANDARD, they may be messing up the lookups. If you do that and the tests still fail, I agree that you need to look through all locations and convert them

@swastis10
Copy link

ERROR: testGeoWithNegativeValues (storageTests.TestMongoGeoJSONQueries.TestGeoJSON) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/storageTests/TestMongoGeoJSONQueries.py", line 20, in tearDown get_test_db().remove() File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 3213, in __call__ raise TypeError( TypeError: 'Collection' object is not callable. If you meant to call the 'remove' method on a 'Collection' object it is failing because no such method exists.

The solution can be found at:
https://pymongo.readthedocs.io/en/stable/migrate-to-pymongo4.html#collection-remove-is-removed

PyMongo 4.x does not have collections.remove anymore therefore I will change it to delete_one or delete_many

@swastis10
Copy link

FAIL: testResetToStart (emission.tests.pipelineTests.TestPipelineReset.TestPipelineReset) Load data for both days Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/pipelineTests/TestPipelineReset.py", line 144, in testResetToStart self.compare_result(ad.AttrDict({'result': api_result}).result, File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/pipelineTests/TestPipelineReset.py", line 82, in compare_result self.assertAlmostEqual(rt.properties.distance, et.properties.distance, places=2) AssertionError: 78703.08791302175 != 78703.13870911207 within 2 places (0.05079609031963628 difference)

Approach:
Comparing the testcase with the master branch and looking for discrepancies at each step.

@swastis10
Copy link

After handling UUID representation using 'python_legacy', numerous test cases have started failing. Some of the failing test case errors are : FAIL: testGetTimeRangeForTrip (emission.tests.storageTests.TestTripQueries.TestTripQueries) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/storageTests/TestTripQueries.py", line 56, in testGetTimeRangeForTrip new_trip = self.create_fake_trip() File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/storageTests/TestTripQueries.py", line 53, in create_fake_trip return etsa.createNewTripLike(self, esda.RAW_TRIP_KEY, ecwrt.Rawtrip) File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/storageTests/analysis_ts_common.py", line 32, in createNewTripLike utest.assertEqual(new_trip_entry.user_id, utest.testUserId) AssertionError: UUID('3f205663-671b-3333-91d8-347d74a88f83') != Binary(b'? Vcg\x1b33\x91\xd84}t\xa8\x8f\x83', 3)

As we can see the last line of the error, one user_id is binary and another one is not. Therefore, I believe I am not handling UUID representation somewhere

Using MongoClient(url, uuidRepresentation='pythonLegacy') in get_database.py fixed the issue. I was earlier using Binary.from_uuid to encode the UUID which is not required as the drivers are taking care of it on its own.

@shankari
Copy link
Contributor Author

@swastis10

AssertionError: 78703.08791302175 != 78703.13870911207 within 2 places (0.05079609031963628 difference)

This seems like it is very close and is basically related to arithmetic.
Are there any known arithmetic changes in the most recent versions of python?
e.g. during the python2 -> python3 migration, the meaning of / changed

@swastis10
Copy link

@swastis10

AssertionError: 78703.08791302175 != 78703.13870911207 within 2 places (0.05079609031963628 difference)

This seems like it is very close and is basically related to arithmetic. Are there any known arithmetic changes in the most recent versions of python? e.g. during the python2 -> python3 migration, the meaning of / changed

@shankari Yes! In Python 2.x, If you divide one integer by another you get an integer result. So 5/2 = 2 instead of 2.5. You get floor division, not true division (Python – Changing the Division Operator). In Python 3, true division is the default but in Python 2.x you need to make one of the numbers a float to get a float returned. So 5.0/2 = 2.5

@shankari
Copy link
Contributor Author

shankari commented Feb 17, 2023

@swastis10 yes, I had to deal with that during the python2 -> python3 migration (which was more complex than this BTW). But my question was whether there was any change in the recent versions of python aka the ones affected by this upgrade?
Note that we are migrating from 3.x to 3.x at this point.

@swastis10
Copy link

FAIL: testUserInputRealData (emission.tests.storageTests.TestTripQueries.TestTripQueries) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/storageTests/TestTripQueries.py", line 208, in testUserInputRealData self.assertEqual(len(ct_df), 4) AssertionError: 0 != 4

ERROR:root:check start: [0.0, 0.03622336951451534, 1.1045714738089647, 1.1045722793516248, 1.104573084938604, 1.1045738904760256, 1.1045746960577667, 1.1045755015899497, 1.1045763071664523, 1.1045771126933965] != [0.0, 0.03718177767684137, 1.1045714738089647, 1.1045722793516248, 1.104573084938604, 1.1045738904760256, 1.1045746960577667, 1.1045755015899497, 1.1045763071664523, 1.1045771126933965] ERROR:root:check end: [0.08014813391947762, 0.08014814035214789, 0.08014814681254091, 0.08014815324840159, 0.06428355605243694, 0.7701516234477649, 0.02431805135307263, 0.0, 0.27697211707197156, 0.2861801010260156] != [0.08014813391947762, 0.08014814035214789, 0.08014814681254091, 0.08014815324840159, 0.06428355605243694, 0.7701516234477649, 0.02431805135307263, 0.0, 0.27697211707197156, 0.2861801010260156] ERROR:root:Cleaning and resampling failed for user 489ce4dc-cc49-431a-ad26-53d371dcd5a0 Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 82, in clean_and_resample

On line 1141 of clean_and_resample, the output of master is:
fix_squished_place: after recomputing for validation, with_speeds_df.head = _id ts ... speed from_points_speed 0 63eed158d88a0f6f5204b310 1.481592e+09 ... 0.000000 0.000000 1 63eed157d88a0f6f5204b28e 1.481592e+09 ... 0.036223 0.036223 2 63eed157d88a0f6f5204b28f 1.481592e+09 ... 1.104616 1.104616 3 63eed157d88a0f6f5204b290 1.481592e+09 ... 1.104616 1.104616 4 63eed157d88a0f6f5204b291 1.481592e+09 ... 1.104617 1.104617 [5 rows x 8 columns] fix_squished_place: after recomputing for validation, with_speeds_df.tail = _id ts ... speed from_points_speed 66 63eed157d88a0f6f5204b2cf 1.481594e+09 ... 0.770152 0.770152 67 63eed157d88a0f6f5204b2d0 1.481594e+09 ... 0.024039 0.024039 68 63eed157d88a0f6f5204b2d1 1.481594e+09 ... 0.000000 0.000000 69 63eed157d88a0f6f5204b2d2 1.481594e+09 ... 0.277076 0.277076 70 63eed157d88a0f6f5204b2d3 1.481594e+09 ... 0.286180 0.286180

The output of my branch is:
fix_squished_place: after recomputing for validation, with_speeds_df.head = _id ts ... speed from_points_speed 0 63eed6ed6ce89356fae757e9 1.481592e+09 ... 0.000000 0.000000 1 63eed6ec6ce89356fae75767 1.481592e+09 ... 0.036223 0.037182 2 63eed6ec6ce89356fae75768 1.481592e+09 ... 1.104571 1.104571 3 63eed6ec6ce89356fae75769 1.481592e+09 ... 1.104572 1.104572 4 63eed6ec6ce89356fae7576a 1.481592e+09 ... 1.104573 1.104573 [5 rows x 8 columns] fix_squished_place: after recomputing for validation, with_speeds_df.tail = _id ts ... speed from_points_speed 66 63eed6ec6ce89356fae757a8 1.481594e+09 ... 0.770152 0.770152 67 63eed6ec6ce89356fae757a9 1.481594e+09 ... 0.024318 0.024318 68 63eed6ec6ce89356fae757aa 1.481594e+09 ... 0.000000 0.000000 69 63eed6ec6ce89356fae757ab 1.481594e+09 ... 0.276972 0.276972 70 63eed6ec6ce89356fae757ac 1.481594e+09 ... 0.286180 0.286180

@shankari
Copy link
Contributor Author

shankari commented Feb 17, 2023

@swastis10 I'm squinting really hard, but it clearly looks like some kind of arithmetic issue - e.g.

0.036223           0.036223
0.036223           0.037182

or 

1.104616           1.104616
1.104571           1.104571

or

0.277076           0.277076
0.276972           0.276972

BTW, thanks for formatting to make that easier to see. Note also that the new values are always lower by around 0.0001
There can be two causes for this: python or pandas (since this is in the context of a pandas dataframe).

I would suggest two approaches going forward:

@swastis10
Copy link

swastis10 commented Feb 22, 2023

@swastis10 I'm squinting really hard, but it clearly looks like some kind of arithmetic issue - e.g.

0.036223           0.036223
0.036223           0.037182

or 

1.104616           1.104616
1.104571           1.104571

or

0.277076           0.277076
0.276972           0.276972

BTW, thanks for formatting to make that easier to see. Note also that the new values are always lower by around 0.0001 There can be two causes for this: python or pandas (since this is in the context of a pandas dataframe).

I would suggest two approaches going forward:

I can see that I am upgrading from python 3.7.12 to 3.9.16
Upon going through the documentation, In Python 3.8 there are no arithematic upgrades from v3.7 to v3.9.
In Pandas too, I do not see any such changes from v1.1.0 to v1.5.3 upgrade.

loc_df just before add_dist_heading_speed is not the same which means this is not a pandas issue. Also, upon lowering the version of Pandas to v1.1.0 also does not fix the problem.

I put log statements in the code and noticed that cleaned_start_place_data.location.coordinates in _fix_squished_place_mismatch has only 6 decimal places whereas in master is has 7 decimal places. When distance is calculated between these points in calDistance, the result is now different and hence our test case fails.

I traced back in the code and realised that get_timeline_from_dt in timeline.py generated this cleaned_start_place_data using GeoJson.

Upon lower the version of GeoJson to 2.4.1, the test case starts passing.

This default precision is an intentional change in GeoJson v2.5 - jazzband/geojson#135

Proposed solution:

  1. We can provide precision number whereever we are creating GeoJSON objects. E.g - DEFAULT_PRECISION = 6 (https://github.com/rogerlew/geojson/blob/master/geojson/geometry.py)
  2. Round off the latitude/logitude points to 6 decimal places

@swastis10
Copy link

swastis10 commented Feb 22, 2023

ERROR: testLocalDateReadWrite (storageTests.TestLocalDateQueries.TestLocalDateQueries) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/storageTests/TestLocalDateQueries.py", line 56, in testLocalDateReadWrite self.assertGreaterEqual(ret_entry.metadata.write_ts, start_ts) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/unittest/case.py", line 1239, in assertGreaterEqual if not a >= b: TypeError: '>=' not supported between instances of 'float' and 'method'

Caused due to upgrade in Arrow dependency.
Fix-
v0.13.1 - arrow.now().timestamp
v1.2.3 - arrow.now().int_timestamp

@swastis10
Copy link

swastis10 commented Feb 22, 2023

ERROR: testPandasMergeBehavior (analysisTests.intakeTests.TestFilterAccuracy.TestFilterAccuracy) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/analysisTests/intakeTests/TestFilterAccuracy.py", line 168, in testPandasMergeBehavior merged_left_idx = df_a.merge(df_b, on="ts", how="inner", left_index=True) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/frame.py", line 10093, in merge return merge( File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 110, in merge op = _MergeOperation( File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 685, in __init__ self.left_on, self.right_on = self._validate_left_right_on(left_on, right_on) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 1454, in _validate_left_right_on raise MergeError( pandas.errors.MergeError: Can only pass argument "on" OR "left_index" and "right_index", not a combination of both.

Caused due to Pandas update.
In https://pandas.pydata.org/docs/reference/api/pandas.merge.html,

onlabel or list
Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

According to the new version of Pandas, we cannot use left_index/right_index and on together so we can define the column on which we want to merge the dataframe and set the indexes using functions like set_index or reset_index

@swastis10
Copy link

ERROR: testAug10MultiSyncEndDetected (analysisTests.intakeTests.TestPipelineRealData.TestPipelineRealData) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/analysisTests/intakeTests/TestPipelineRealData.py", line 396, in testAug10MultiSyncEndDetected logging.debug("ts_1030 = %s, converted back = %s" % (ts_1030, arrow.get(ts_1030).to("America/Los_Angeles"))) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/arrow/api.py", line 91, in get return _factory.get(*args, **kwargs) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/arrow/factory.py", line 267, in get raise TypeError(f"Cannot parse single argument of type {type(arg)!r}.") TypeError: Cannot parse single argument of type <class 'method'>.

Caused due to Arrow dependency upgrade

@swastis10
Copy link

ERROR: testGroupedByOneLocalDayMultiTzGoingEast (emission.tests.analysisTests.resultTests.TestTimeGrouping.TestTimeGrouping) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/analysisTests/resultTests/TestTimeGrouping.py", line 497, in testGroupedByOneLocalDayMultiTzGoingEast self._fillModeDistanceDuration(test_section_list) File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/analysisTests/resultTests/TestTimeGrouping.py", line 268, in _fillModeDistanceDuration dw.sensed_mode = ecwm.PredictedModeTypes.BICYCLING File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/core/wrapper/wrapperbase.py", line 99, in __setattr__ raise AttributeError("property %s is not defined for %s" % (key, self.__class__.__name__)) AttributeError: property sensed_mode is not defined for Confirmedtrip

@swastis10
Copy link

swastis10 commented Feb 22, 2023

FAIL: testJumpSmoothingSectionsStraddle (emission.tests.analysisTests.intakeTests.TestPipelineRealData.TestPipelineRealData) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/analysisTests/intakeTests/TestPipelineRealData.py", line 332, in testJumpSmoothingSectionsStraddle self.compare_result(ad.AttrDict({'result': api_result}).result, File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/analysisTests/intakeTests/TestPipelineRealData.py", line 91, in compare_result self.assertEqual(len(result), len(expect)) AssertionError: 0 != 7

Fix -
arrow-py/arrow#612

Timestamp strings are no longer supported in the arrow.get() method without a format string: arrow.get("1565358758"). This change was made to support the ISO 8601 basic format and to address bugs

@swastis10
Copy link

swastis10 commented Feb 22, 2023

ERROR: testResetToStart (pipelineTests.TestPipelineReset.TestPipelineReset) Load data for both days Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/pipelineTests/TestPipelineReset.py", line 156, in testResetToStart self.assertEqual(api_result, []) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/unittest/case.py", line 837, in assertEqual assertion_func(first, second, msg=msg) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/unittest/case.py", line 1043, in assertListEqual self.assertSequenceEqual(list1, list2, msg, seq_type=list) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/unittest/case.py", line 1020, in assertSequenceEqual difflib.ndiff(pprint.pformat(seq1).splitlines(), File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 58, in pformat return PrettyPrinter(indent=indent, width=width, depth=depth, File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 153, in pformat self._format(object, sio, 0, 0, {}, 0) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 170, in _format rep = self._repr(object, context, level) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 431, in _repr repr, readable, recursive = self.format(object, context.copy(), File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 444, in format return _safe_repr(object, context, maxlevels, level, self._sort_dicts) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 587, in _safe_repr orepr, oreadable, orecur = _safe_repr(o, context, maxlevels, level, sort_dicts) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 596, in _safe_repr rep = repr(object) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/geojson/base.py", line 27, in __repr__ return geojson.dumps(self, sort_keys=True) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/geojson/codec.py", line 31, in dumps return json.dumps(to_mapping(obj), File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/__init__.py", line 234, in dumps return cls( File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/geojson/codec.py", line 14, in default return geojson.factory.GeoJSON.to_instance(obj) # NOQA File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/geojson/base.py", line 101, in to_instance mapping = to_mapping(ob) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/geojson/mapping.py", line 41, in to_mapping return json.loads(json.dumps(obj)) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/__init__.py", line 231, in dumps return _default_encoder.encode(obj) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' TypeError: Object of type ObjectId is not JSON serializable

@swastis10
Copy link

FAIL: testGetSingleStateResets (pipelineTests.TestPipelineReset.TestPipelineReset) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/pipelineTests/TestPipelineReset.py", line 717, in testGetSingleStateResets self.assertEqual(reset_ts_df.reset_ts.to_list(), AssertionError: Lists differ: [1663621200.0, 1663707600.0, 1663794000.0] != [<bound method Arrow.timestamp of <Arrow [[166 chars]0]>>] First differing element 0: 1663621200.0 <bound method Arrow.timestamp of <Arrow [2022-09-19T21:00:00+00:00]>> [1663621200.0, 1663707600.0, 1663794000.0] [<bound method Arrow.timestamp of <Arrow [2022-09-19T21:00:00+00:00]>>, <bound method Arrow.timestamp of <Arrow [2022-09-20T21:00:00+00:00]>>, <bound method Arrow.timestamp of <Arrow [2022-09-21T21:00:00+00:00]>>]
Caused due to upgrade in Arrow dependency.
Fix-
v0.13.1 - arrow.now().timestamp
v1.2.3 - arrow.now().timestamp()

@shankari
Copy link
Contributor Author

shankari commented Feb 22, 2023

wrt #856 (comment) (the geojson precision issue), I vote for whatever the long-term plans for the geojson community are. I looked at the related issue jazzband/geojson#135, and it looks like the long-term plans have changed.

The original plan was to have the truncated precision be the long term.
However, based on interoperability with mercantile etc, the plan is to revert the change.
There is a patch pending which hasn't been merged yet.

So we should work around this by using the suggested global monkey patch
jazzband/geojson#135 (comment)

We should add link to this comment in the commit message
and add a comment just before monkey patch indicating that it should be removed in the next server update, by which time hopefully the new geojson version will incorporate the long-term fix jazzband/geojson#177

@shankari
Copy link
Contributor Author

TypeError: Object of type ObjectId is not JSON serializable

happens when we try to serialize a document read from MongoDB directly (it will have a binary objectID and a binary UUID) without using the default parameter to json.dump(s)

import bson.json_util as bju
json.dump(... default=bju.default)

@swastis10
Copy link

swastis10 commented Feb 22, 2023

ERROR: testAutoResetReal (pipelineTests.TestPipelineReset.TestPipelineReset) Load data for both days Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/pipelineTests/TestPipelineReset.py", line 873, in testAutoResetReal epr.auto_reset(dry_run=False, only_calc=False) File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/pipeline/reset.py", line 357, in auto_reset reset_user_to_ts(invalid_state['user_id'], invalid_state['reset_ts'], dry_run) File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/pipeline/reset.py", line 79, in reset_user_to_ts reset_last_place(last_cleaned_place, is_dry_run) File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/pipeline/reset.py", line 171, in reset_last_place result = edb.get_analysis_timeseries_db().update(match_query, reset_query) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 3213, in __call__ raise TypeError( TypeError: 'Collection' object is not callable. If you meant to call the 'update' method on a 'Collection' object it is failing because no such method exists.

Fix:

Removed pymongo.collection.Collection.update(). Use update_one() to update a single document or update_many() to update multiple documents

@swastis10
Copy link

FAIL: testGroupedByOneLocalDayMultiTzGoingEast (analysisTests.resultTests.TestTimeGrouping.TestTimeGrouping) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/analysisTests/resultTests/TestTimeGrouping.py", line 532, in testGroupedByOneLocalDayMultiTzGoingEast self.assertEqual(summary_ld_last, AssertionError: 1462327200.0 != <bound method Arrow.timestamp of <Arrow [2016-05-04T03:00:00+01:00]>>

Caused due to upgrade in Arrow dependency.
Fix-
v0.13.1 - arrow.now().timestamp
v1.2.3 - arrow.now().timestamp()

@swastis10
Copy link

`FAIL: testResetToPastWithCrash (pipelineTests.TestPipelineReset.TestPipelineReset)

  • Load data for one day
    Traceback (most recent call last):
    File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/pipelineTests/TestPipelineReset.py", line 609, in testResetToPastWithCrash
    self.assertNotIn("exit_ts", last_raw_place_first_day["data"])
    AssertionError: 'exit_ts' unexpectedly found in {'source': 'DwellSegmentationTimeFilter', 'enter_ts': 1469233296, 'enter_local_dt': {'hour': 17, 'month': 7, 'second': 36, 'weekday': 4, 'year': 2016, 'timezone': 'America/Los_Angeles', 'day': 22, 'minute': 21}, 'enter_fmt_time': '2016-07-22T17:21:36-07:00', 'ending_trip': ObjectId('63f6cdf7bc9b57d0ea35a25c'), 'location': {'type': 'Point', 'coordinates': [-122.26827, 37.8700362]}, 'exit_ts': 1469233301, 'exit_local_dt': {'hour': 17, 'month': 7, 'second': 36, 'weekday': 4, 'year': 2016, 'timezone': 'America/Los_Angeles', 'day': 22, 'minute': 21}, 'exit_fmt_time': '2016-07-22T17:21:36-07:00', 'starting_trip': ObjectId('63f6cdf7bc9b57d0ea35a260'), 'duration': 5}`

@shankari
Copy link
Contributor Author

self.assertNotIn("exit_ts", last_raw_place_first_day["data"])
AssertionError: 'exit_ts' unexpectedly found in {'source': 'DwellSegmentationTimeFilter',

This seems similar to the confirmed_trip issue in #856 (comment) in that there is some weirdness in running the pipeline

@swastis10
Copy link

ERROR: testResetToStart (pipelineTests.TestPipelineReset.TestPipelineReset) Load data for both days Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/pipelineTests/TestPipelineReset.py", line 156, in testResetToStart self.assertEqual(api_result, []) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/unittest/case.py", line 837, in assertEqual assertion_func(first, second, msg=msg) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/unittest/case.py", line 1043, in assertListEqual self.assertSequenceEqual(list1, list2, msg, seq_type=list) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/unittest/case.py", line 1020, in assertSequenceEqual difflib.ndiff(pprint.pformat(seq1).splitlines(), File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 58, in pformat return PrettyPrinter(indent=indent, width=width, depth=depth, File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 153, in pformat self._format(object, sio, 0, 0, {}, 0) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 170, in _format rep = self._repr(object, context, level) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 431, in _repr repr, readable, recursive = self.format(object, context.copy(), File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 444, in format return _safe_repr(object, context, maxlevels, level, self._sort_dicts) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 587, in _safe_repr orepr, oreadable, orecur = _safe_repr(o, context, maxlevels, level, sort_dicts) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/pprint.py", line 596, in _safe_repr rep = repr(object) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/geojson/base.py", line 27, in __repr__ return geojson.dumps(self, sort_keys=True) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/geojson/codec.py", line 31, in dumps return json.dumps(to_mapping(obj), File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/__init__.py", line 234, in dumps return cls( File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/geojson/codec.py", line 14, in default return geojson.factory.GeoJSON.to_instance(obj) # NOQA File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/geojson/base.py", line 101, in to_instance mapping = to_mapping(ob) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/geojson/mapping.py", line 41, in to_mapping return json.loads(json.dumps(obj)) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/__init__.py", line 231, in dumps return _default_encoder.encode(obj) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/Users/ssharma/miniconda-23.1.0/envs/emission/lib/python3.9/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' TypeError: Object of type ObjectId is not JSON serializable

The testResetTwiceHack test in TestPipelineReset.py is now trying to print a json which has an empty array earlier because of which it is now giving us a serialization error.

@shankari
Copy link
Contributor Author

shankari commented Feb 23, 2023

this error is because of this line

self.assertEqual(api_result, [])

Before @swastis10's changes, the api_result was indeed [] so the assertion succeeds and so nothing is printed out.
with @swastis10's changes, the api_result is not [], so the assertion fails, so python tries to print out the non blank array, and then fails again because the non-blank array has been read from the database and has UUID and ObjectID objects which cannot be JSON serialized.

The fix for this is to use the default parameter. However, the error message from the assertEqual failure was automatically generated. So how do we make it use default while serializing for printing so that when it fails, it fails with a more intelligible message that we can debug?

The answer is to use the msg parameter to assertEqual. You would do something like

self.assertEqual(api_result, [], msg=f"api_result={json.dumps(api_result, default=bju.default)}, expected []")

@swastis10
Copy link

FAIL: testNormalizeWithACursor (emission.tests.pipelineTests.TestPipelineReset.TestPipelineReset) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/pipelineTests/TestPipelineReset.py", line 788, in testNormalizeWithACursor self.assertEqual(len(df_from_cursor), len(invalid_states_mixed) - 1) AssertionError: 8 != 7

This error is originating from TestPipelineReset.py

        # This is actually incorrect because we saved everything, so we should read back everything
        # but it is the current behavior, so let's flag if it changes
        self.assertEqual(len(df_from_cursor), len(invalid_states_mixed) - 1)
        # This is the expected behavior in all cases, but let's make sure that it stays as we move through versions of pandas
        self.assertEqual(len(df_from_list), len(invalid_states_mixed))

I understand this error is occuring due to the pandas upgrade. df_from_cursor has one less record which we were initially saving into the document db. Now, this behaviour has been fixed in pandas. Therefore, if I do:

        self.assertEqual(len(df_from_cursor), len(invalid_states_mixed))

The test case is fixed because it is returning whatever we are saving in the document.

@swastis10
Copy link

FAIL: testFilterAccuracyWithPartialFiltered (emission.tests.analysisTests.intakeTests.TestFilterAccuracy.TestFilterAccuracy) Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/analysisTests/intakeTests/TestFilterAccuracy.py", line 161, in testFilterAccuracyWithPartialFiltered self.assertEqual(len(filtered_points_df), 703) AssertionError: 203 != 703

This is happening due to the change in Pandas merge functionality in the higher function.
Now, we Can only pass argument "on" OR "left_index" and "right_index", not a combination of both. therefore we need to change the implementation of merge in filter_accuracy.py L:103

swastis10 added a commit to swastis10/e-mission-server that referenced this issue Mar 6, 2023
We are using 'python_legacy' instead of 'standard' UUID representation here because standard UUID representation should be used by new applications or applications that are encoding and/or decoding UUIDs in MongoDB for the first time. Python_legacy should be used by already existing applications using python drivers to encode and/or decode.

Ours is a python application that has already been storing data in databases for many months. It is NOT "encoding or decoding for the first time". If we used standard, although the unit tests might run, it would break in production.

Contains fix for the issue:
e-mission/e-mission-docs#856 (comment)
swastis10 added a commit to swastis10/e-mission-server that referenced this issue Mar 6, 2023
For more details- jazzband/geojson#135
It should be removed in the next server update, by which time hopefully the new geojson version will incorporate the long-term fix
See - jazzband/geojson#177
Contains fix for the issue:
e-mission/e-mission-docs#856 (comment)
@swastis10
Copy link

it should be removed in the next server update, by which time hopefully the new geojson version will incorporate the long-term fix jazzband/geojson#177

Yes, @shankari it does. Upon debugging and comparing the issue with master, I noticed that there were many coordinates which were being truncated to 7 decimal places whereas in master we had in upto 12 significant places so I bumped it upto 15 places which fixed the issue.

swastis10 added a commit to swastis10/e-mission-server that referenced this issue Mar 6, 2023
We can only pass argument "on" OR "left_index" and "right_index", not a combination of both in pd.merge()
Fix for the issue:
e-mission/e-mission-docs#856 (comment)
@swastis10
Copy link

FAIL: testIncrementalRun (emission.tests.modellingTests.TestRunGreedyIncrementalModel.TestRunGreedyModel) incremental trip models train from Confirmedtrip entries at most Traceback (most recent call last): File "/Users/ssharma/Desktop/GIT/e-mission-server/emission/tests/modellingTests/TestRunGreedyIncrementalModel.py", line 63, in setUp self.fail() AssertionError: None

Fail reason : In this testcase, we are trying to bulk insert data into the db. Once data is written, we try to read it. For some reason, we are not able to find that written entries in the db.

In builtin_timeseries.py, when we use _get_entries_for_timeseries method to fetch the data:

if key_list is None or len(key_list) > 0:
            ts_query = self._get_query(key_list, time_query, geo_query,
                                extra_query_list)
            ts_db_cursor = tsdb.find(ts_query)
            ts_db_count = tsdb.count_documents(ts_query)

where ts_query = {'user_id': UUID('aa9fdec9-2944-446c-8ee2-50d79b3044d3'), '$or': [{'metadata.key': 'analysis/confirmed_trip'}]}

ts_db_cursor is still 0 even when I can see 6 entries for 'metadata.key': 'analysis/confirmed_trip' in the database.

If I write

if key_list is None or len(key_list) > 0:
            ts_query = self._get_query(key_list, time_query, geo_query,
                                extra_query_list)
            ts_db_cursor = tsdb.find({'metadata.key': 'analysis/confirmed_trip'})
            ts_db_count = tsdb.count_documents(ts_query)

The test case passes.

@shankari upon going through the logs I found a discrepancy:

2023-03-06 14:49:45,321:DEBUG:curr_query = {'user_id': UUID('aa9fdec9-2944-446c-8ee2-50d79b3044d3'), '$or': [{'metadata.key': 'analysis/confirmed_trip'}]}, sort_key = metadata.write_ts
2023-03-06 14:49:45,321:DEBUG:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/confirmed_trip']
2023-03-06 14:49:45,321:DEBUG:finished querying values for [], count = 0
with tsquery 0
with user_id field 0
without user_id field 6

The. number of output with and without user_id is different

@shankari
Copy link
Contributor Author

shankari commented Mar 7, 2023

@swastis10 that's weird. Again, can I see the code that generated those log statements?
I know we wrote it together, but it is easier if I can see if while I'm looking at the output
Also, can you print out the confirmed trips that you get when you query without the user_id field and see what their user_ids are?

@swastis10
Copy link

@shankari
Code to generate the log statements:
print(f"with tsquery {len(list(tsdb.find(ts_query)))}") print(f"with user_id field {len(list(tsdb.find({'user_id': uuid.UUID('aa9fdec9-2944-446c-8ee2-50d79b3044d3'), '$or': [{'metadata.key': 'analysis/confirmed_trip'}]})))}") print(f"without user_id field {len(list(tsdb.find({'$or': [{'metadata.key': 'analysis/confirmed_trip'}]})))}")

Logs:
with tsquery 0 with user_id field 0 without user_id field 6 2023-03-07 09:49:42,740:DEBUG:finished querying values for ['analysis/confirmed_trip'], count = 6 ENTRIES [{'_id': ObjectId('5fd8e69ac61669a9ebad0241'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050275.276295, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 55, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:55.276295-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466437275.856, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 8, 'minute': 41, 'second': 15, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T08:41:15.856000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.0826931, 37.3914184]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d349789'), 'start_ts': 1466436483.395, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 8, 'minute': 28, 'second': 3, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T08:28:03.395000-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.0857861, 37.3898049]}, 'duration': 792.4609999656677, 'distance': 1047.1630675866315, 'start_place': ObjectId('5fd8e664baff4ef23d349860'), 'end_place': ObjectId('5fd8e664baff4ef23d349861'), 'cleaned_trip': ObjectId('5fd8e663baff4ef23d3497af'), 'user_input': {'mode_confirm': 'walk', 'purpose_confirm': 'library'}}}, {'_id': ObjectId('5fd8e69ac61669a9ebad0242'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050275.488737, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 55, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:55.488737-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466438022.959, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 8, 'minute': 53, 'second': 42, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T08:53:42.959000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.0866181, 37.3910231]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d34978b'), 'start_ts': 1466437438.6453953, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 8, 'minute': 43, 'second': 58, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T08:43:58.645395-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.0826931, 37.3914184]}, 'duration': 584.3136048316956, 'distance': 886.4937093667857, 'start_place': ObjectId('5fd8e664baff4ef23d349861'), 'end_place': ObjectId('5fd8e664baff4ef23d349862'), 'cleaned_trip': ObjectId('5fd8e663baff4ef23d3497ce'), 'user_input': {}}}, {'_id': ObjectId('5fd8e69ac61669a9ebad0243'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050275.7204192, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 55, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:55.720419-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466461966.379, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 32, 'second': 46, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T15:32:46.379000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.0830016, 37.3901637]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d34978d'), 'start_ts': 1466461623.1195338, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 27, 'second': 3, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T15:27:03.119534-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.0866181, 37.3910231]}, 'duration': 343.25946617126465, 'distance': 610.2234223038181, 'start_place': ObjectId('5fd8e664baff4ef23d349862'), 'end_place': ObjectId('5fd8e664baff4ef23d349863'), 'cleaned_trip': ObjectId('5fd8e663baff4ef23d3497e6'), 'user_input': {}}}, {'_id': ObjectId('5fd8e69ac61669a9ebad0244'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050275.942955, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 55, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:55.942955-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466462452.708, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 40, 'second': 52, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T15:40:52.708000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.086605, 37.3910011]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d34978f'), 'start_ts': 1466462052.158904, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 34, 'second': 12, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T15:34:12.158904-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.0830016, 37.3901637]}, 'duration': 400.54909586906433, 'distance': 405.97685486691756, 'start_place': ObjectId('5fd8e664baff4ef23d349863'), 'end_place': ObjectId('5fd8e664baff4ef23d349864'), 'cleaned_trip': ObjectId('5fd8e663baff4ef23d3497f6'), 'user_input': {'mode_confirm': 'walk', 'purpose_confirm': 'home'}}}, {'_id': ObjectId('5fd8e69ac61669a9ebad0245'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050276.1554408, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 56, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:56.155441-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466463835.713, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 16, 'minute': 3, 'second': 55, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T16:03:55.713000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.1081974, 37.4168828]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d349791'), 'start_ts': 1466462970.2807262, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 49, 'second': 30, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T15:49:30.280726-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.086605, 37.3910011]}, 'duration': 865.4322738647461, 'distance': 4521.417177464177, 'start_place': ObjectId('5fd8e664baff4ef23d349864'), 'end_place': ObjectId('5fd8e664baff4ef23d349865'), 'cleaned_trip': ObjectId('5fd8e664baff4ef23d349808'), 'user_input': {'mode_confirm': 'shared_ride', 'purpose_confirm': 'karate'}}}, {'_id': ObjectId('5fd8e69ac61669a9ebad0246'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050276.546149, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 56, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:56.546149-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466467959.767, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 17, 'minute': 12, 'second': 39, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T17:12:39.767000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.0864051, 37.3907649]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d349795'), 'start_ts': 1466466584.0461695, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 16, 'minute': 49, 'second': 44, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T16:49:44.046170-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.1081974, 37.4168828]}, 'duration': 1375.7208304405212, 'distance': 5136.824369981995, 'start_place': ObjectId('5fd8e664baff4ef23d349865'), 'end_place': ObjectId('5fd8e664baff4ef23d349867'), 'cleaned_trip': ObjectId('5fd8e664baff4ef23d34982c'), 'user_input': {'mode_confirm': 'shared_ride', 'purpose_confirm': 'home'}}}]

@shankari
Copy link
Contributor Author

shankari commented Mar 7, 2023

@swastis10 the reason is right there in the logs 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4). You have forgotten python_legacy either during read or during write.

@swastis10
Copy link

swastis10 commented Mar 7, 2023

@swastis10 the reason is right there in the logs 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4). You have forgotten python_legacy either during read or during write.

@shankari
When we are loading data from a test file and putting those entries into the database, I see those user_id entries as Binary whereas the query does not have any user_id Binary values.

Potential reason - bson.json_util serializes the $uuid as subtype 4 (Binary) and not subtype 3.

Solution - I can try to deserialize '$uuid' in the document using UUID representation.

bson.json_util.LEGACY_JSON_OPTIONS: bson.json_util.JSONOptions = JSONOptions(strict_number_long=False, datetime_representation=0, strict_uuid=False, json_mode=0, document_class=dict, tz_aware=False, uuid_representation=UuidRepresentation.UNSPECIFIED, unicode_decode_error_handler='strict', tzinfo=None, type_registry=TypeRegistry(type_codecs=[], fallback_encoder=None), datetime_conversion=DatetimeConversion.DATETIME)
JSONOptions for encoding to PyMongo’s legacy JSON format.

Code Snippet-
# load in trips from a test file source input_file = 'emission/tests/data/real_examples/shankari_2016-06-20.expected_confirmed_trips' with open(input_file, 'r') as f: trips_json = json.loads(f.read(), object_hook=bju.object_hook) print("TRIP_JSON", trips_json)

Logs-
[{'_id': ObjectId('5fd8e69ac61669a9ebad0241'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050275.276295, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 55, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:55.276295-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466437275.856, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 8, 'minute': 41, 'second': 15, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T08:41:15.856000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.0826931, 37.3914184]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d349789'), 'start_ts': 1466436483.395, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 8, 'minute': 28, 'second': 3, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T08:28:03.395000-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.0857861, 37.3898049]}, 'duration': 792.4609999656677, 'distance': 1047.1630675866315, 'start_place': ObjectId('5fd8e664baff4ef23d349860'), 'end_place': ObjectId('5fd8e664baff4ef23d349861'), 'cleaned_trip': ObjectId('5fd8e663baff4ef23d3497af'), 'user_input': {'mode_confirm': 'walk', 'purpose_confirm': 'library'}}}, {'_id': ObjectId('5fd8e69ac61669a9ebad0242'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050275.488737, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 55, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:55.488737-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466438022.959, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 8, 'minute': 53, 'second': 42, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T08:53:42.959000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.0866181, 37.3910231]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d34978b'), 'start_ts': 1466437438.6453953, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 8, 'minute': 43, 'second': 58, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T08:43:58.645395-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.0826931, 37.3914184]}, 'duration': 584.3136048316956, 'distance': 886.4937093667857, 'start_place': ObjectId('5fd8e664baff4ef23d349861'), 'end_place': ObjectId('5fd8e664baff4ef23d349862'), 'cleaned_trip': ObjectId('5fd8e663baff4ef23d3497ce'), 'user_input': {}}}, {'_id': ObjectId('5fd8e69ac61669a9ebad0243'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050275.7204192, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 55, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:55.720419-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466461966.379, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 32, 'second': 46, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T15:32:46.379000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.0830016, 37.3901637]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d34978d'), 'start_ts': 1466461623.1195338, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 27, 'second': 3, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T15:27:03.119534-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.0866181, 37.3910231]}, 'duration': 343.25946617126465, 'distance': 610.2234223038181, 'start_place': ObjectId('5fd8e664baff4ef23d349862'), 'end_place': ObjectId('5fd8e664baff4ef23d349863'), 'cleaned_trip': ObjectId('5fd8e663baff4ef23d3497e6'), 'user_input': {}}}, {'_id': ObjectId('5fd8e69ac61669a9ebad0244'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050275.942955, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 55, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:55.942955-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466462452.708, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 40, 'second': 52, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T15:40:52.708000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.086605, 37.3910011]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d34978f'), 'start_ts': 1466462052.158904, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 34, 'second': 12, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T15:34:12.158904-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.0830016, 37.3901637]}, 'duration': 400.54909586906433, 'distance': 405.97685486691756, 'start_place': ObjectId('5fd8e664baff4ef23d349863'), 'end_place': ObjectId('5fd8e664baff4ef23d349864'), 'cleaned_trip': ObjectId('5fd8e663baff4ef23d3497f6'), 'user_input': {'mode_confirm': 'walk', 'purpose_confirm': 'home'}}}, {'_id': ObjectId('5fd8e69ac61669a9ebad0245'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050276.1554408, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 56, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:56.155441-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466463835.713, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 16, 'minute': 3, 'second': 55, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T16:03:55.713000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.1081974, 37.4168828]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d349791'), 'start_ts': 1466462970.2807262, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 49, 'second': 30, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T15:49:30.280726-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.086605, 37.3910011]}, 'duration': 865.4322738647461, 'distance': 4521.417177464177, 'start_place': ObjectId('5fd8e664baff4ef23d349864'), 'end_place': ObjectId('5fd8e664baff4ef23d349865'), 'cleaned_trip': ObjectId('5fd8e664baff4ef23d349808'), 'user_input': {'mode_confirm': 'shared_ride', 'purpose_confirm': 'karate'}}}, {'_id': ObjectId('5fd8e69ac61669a9ebad0246'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050276.546149, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 56, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:56.546149-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466467959.767, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 17, 'minute': 12, 'second': 39, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T17:12:39.767000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.0864051, 37.3907649]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d349795'), 'start_ts': 1466466584.0461695, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 16, 'minute': 49, 'second': 44, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T16:49:44.046170-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.1081974, 37.4168828]}, 'duration': 1375.7208304405212, 'distance': 5136.824369981995, 'start_place': ObjectId('5fd8e664baff4ef23d349865'), 'end_place': ObjectId('5fd8e664baff4ef23d349867'), 'cleaned_trip': ObjectId('5fd8e664baff4ef23d34982c'), 'user_input': {'mode_confirm': 'shared_ride', 'purpose_confirm': 'home'}}}] 2023-03-07 14:53:32,235:DEBUG:loaded 6 trips from emission/tests/data/real_examples/shankari_2016-06-20.expected_confirmed_trips GLIST [Entry({'_id': ObjectId('5fd8e69ac61669a9ebad0241'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050275.276295, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 55, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:55.276295-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466437275.856, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 8, 'minute': 41, 'second': 15, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T08:41:15.856000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.0826931, 37.3914184]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d349789'), 'start_ts': 1466436483.395, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 8, 'minute': 28, 'second': 3, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T08:28:03.395000-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.0857861, 37.3898049]}, 'duration': 792.4609999656677, 'distance': 1047.1630675866315, 'start_place': ObjectId('5fd8e664baff4ef23d349860'), 'end_place': ObjectId('5fd8e664baff4ef23d349861'), 'cleaned_trip': ObjectId('5fd8e663baff4ef23d3497af'), 'user_input': {'mode_confirm': 'walk', 'purpose_confirm': 'library'}}}), Entry({'_id': ObjectId('5fd8e69ac61669a9ebad0242'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050275.488737, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 55, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:55.488737-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466438022.959, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 8, 'minute': 53, 'second': 42, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T08:53:42.959000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.0866181, 37.3910231]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d34978b'), 'start_ts': 1466437438.6453953, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 8, 'minute': 43, 'second': 58, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T08:43:58.645395-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.0826931, 37.3914184]}, 'duration': 584.3136048316956, 'distance': 886.4937093667857, 'start_place': ObjectId('5fd8e664baff4ef23d349861'), 'end_place': ObjectId('5fd8e664baff4ef23d349862'), 'cleaned_trip': ObjectId('5fd8e663baff4ef23d3497ce'), 'user_input': {}}}), Entry({'_id': ObjectId('5fd8e69ac61669a9ebad0243'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050275.7204192, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 55, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:55.720419-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466461966.379, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 32, 'second': 46, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T15:32:46.379000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.0830016, 37.3901637]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d34978d'), 'start_ts': 1466461623.1195338, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 27, 'second': 3, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T15:27:03.119534-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.0866181, 37.3910231]}, 'duration': 343.25946617126465, 'distance': 610.2234223038181, 'start_place': ObjectId('5fd8e664baff4ef23d349862'), 'end_place': ObjectId('5fd8e664baff4ef23d349863'), 'cleaned_trip': ObjectId('5fd8e663baff4ef23d3497e6'), 'user_input': {}}}), Entry({'_id': ObjectId('5fd8e69ac61669a9ebad0244'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050275.942955, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 55, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:55.942955-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466462452.708, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 40, 'second': 52, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T15:40:52.708000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.086605, 37.3910011]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d34978f'), 'start_ts': 1466462052.158904, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 34, 'second': 12, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T15:34:12.158904-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.0830016, 37.3901637]}, 'duration': 400.54909586906433, 'distance': 405.97685486691756, 'start_place': ObjectId('5fd8e664baff4ef23d349863'), 'end_place': ObjectId('5fd8e664baff4ef23d349864'), 'cleaned_trip': ObjectId('5fd8e663baff4ef23d3497f6'), 'user_input': {'mode_confirm': 'walk', 'purpose_confirm': 'home'}}}), Entry({'_id': ObjectId('5fd8e69ac61669a9ebad0245'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050276.1554408, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 56, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:56.155441-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466463835.713, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 16, 'minute': 3, 'second': 55, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T16:03:55.713000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.1081974, 37.4168828]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d349791'), 'start_ts': 1466462970.2807262, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 15, 'minute': 49, 'second': 30, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T15:49:30.280726-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.086605, 37.3910011]}, 'duration': 865.4322738647461, 'distance': 4521.417177464177, 'start_place': ObjectId('5fd8e664baff4ef23d349864'), 'end_place': ObjectId('5fd8e664baff4ef23d349865'), 'cleaned_trip': ObjectId('5fd8e664baff4ef23d349808'), 'user_input': {'mode_confirm': 'shared_ride', 'purpose_confirm': 'karate'}}}), Entry({'_id': ObjectId('5fd8e69ac61669a9ebad0246'), 'user_id': Binary(b'\xaa\x9f\xde\xc9)DDl\x8e\xe2P\xd7\x9b0D\xd3', 4), 'metadata': {'key': 'analysis/confirmed_trip', 'platform': 'server', 'write_ts': 1608050276.546149, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2020, 'month': 12, 'day': 15, 'hour': 8, 'minute': 37, 'second': 56, 'weekday': 1, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2020-12-15T08:37:56.546149-08:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1466467959.767, 'end_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 17, 'minute': 12, 'second': 39, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2016-06-20T17:12:39.767000-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.0864051, 37.3907649]}, 'raw_trip': ObjectId('5fd8e662baff4ef23d349795'), 'start_ts': 1466466584.0461695, 'start_local_dt': {'year': 2016, 'month': 6, 'day': 20, 'hour': 16, 'minute': 49, 'second': 44, 'weekday': 0, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2016-06-20T16:49:44.046170-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.1081974, 37.4168828]}, 'duration': 1375.7208304405212, 'distance': 5136.824369981995, 'start_place': ObjectId('5fd8e664baff4ef23d349865'), 'end_place': ObjectId('5fd8e664baff4ef23d349867'), 'cleaned_trip': ObjectId('5fd8e664baff4ef23d34982c'), 'user_input': {'mode_confirm': 'shared_ride', 'purpose_confirm': 'home'}}})]

swastis10 added a commit to swastis10/e-mission-server that referenced this issue Mar 8, 2023
BSON binary subtype 3 is a legacy  UUID format. By default, JSON is deserialized as subtype 4. With this fix, it will deserialize $uuid as binary subtype 3

Fix for - e-mission/e-mission-docs#856 (comment)
@shankari
Copy link
Contributor Author

shankari commented Mar 9, 2023

@swastis10 this sounds like a good reason and solution. Have you tried it yet?

Potential reason - bson.json_util serializes the $uuid as subtype 4 (Binary) and not subtype 3.
Solution - I can try to deserialize '$uuid' in the document using UUID representation.

Also, if this is the problem, then why aren't we running into similar issues in
emission/tests/analysisTests/intakeTests/TestPipelineRealData.py

It loads files from emission/tests/data/real_examples which also has the same format -
e.g. emission/tests/data/real_examples/shankari_2016-01-12 has

        "_id": {
            "$oid": "56952dd85771abda98a1df92"
        },
        "user_id": {
            "$uuid": "0763de67f61e3f5d90e7518e69793954"
        },

@swastis10
Copy link

TestPipelineRealData.py

@shankari
Yes! This change fixes the issue.

I took at look at emission/tests/analysisTests/intakeTests/TestPipelineRealData.py and found out that the loaded data contains Binary UUID but the test cases do not break because we are not comparing user_id anywhere. We are ultimately comparing

ground_truth = json.load(gfp, object_hook=bju.object_hook)
            print("GROUND TRUTH", ad.AttrDict(ground_truth).data)


self.compare_result(ad.AttrDict({'result': api_result}).result,
                            ad.AttrDict(ground_truth).data)

where ground_truth.data does not contain any user_id. The pipelines are ran using self.testUUID which contains Binary subtype 3 UUID.

Potential Fix: To avoid this Binary subtype 3 and 4 confusion in the future, In emission/tests/analysisTests/intakeTests/TestPipelineRealData.py I can replace

round_truth = json.load(gfp, object_hook=bju.object_hook)

with

trips_json = bju.loads(f.read(), json_options = bju.LEGACY_JSON_OPTIONS.with_options(strict_uuid= False, uuid_representation= UuidRepresentation.PYTHON_LEGACY))

@shankari
Copy link
Contributor Author

shankari commented Mar 9, 2023

@swastis10 yes, please make the change as well
I'll review after that
All tests are finally passing!

swastis10 added a commit to swastis10/e-mission-server that referenced this issue Mar 9, 2023
BSON binary subtype 3 is a legacy  UUID format. By default, JSON is deserialized as subtype 4. With this fix, it will deserialize $uuid as binary subtype 3
See Issue - e-mission/e-mission-docs#856 (comment)
@shankari
Copy link
Contributor Author

While deploying in the NREL environment, DB connection fails with

/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pymongo/common.py:813: UserWarning: Unknown option ssl_ca_certs

pymongo.errors.ServerSelectionTimeoutError: ... [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129), Timeout: 30s,>

@shankari shankari reopened this Mar 10, 2023
@shankari
Copy link
Contributor Author

shankari commented Mar 10, 2023

fixed by changing to tlsCAFile instead.

Next error is

Traceback (most recent call last):
File "/usr/src/app/emission/net/api/bottle.py", line 997, in _handle
out = route.call(**args)
File "/usr/src/app/emission/net/api/bottle.py", line 2004, in wrapper
json_response = dumps(rv)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/json_util.py", line 438, in dumps
return json.dumps(_json_convert(obj, json_options), *args, **kwargs)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/json_util.py", line 477, in _json_convert
return SON(((k, _json_convert(v, json_options)) for k, v in obj.items()))
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/son.py", line 65, in __init__
self.update(data)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/son.py", line 156, in update
for k, v in other:
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/json_util.py", line 477, in <genexpr>
return SON(((k, _json_convert(v, json_options)) for k, v in obj.items()))
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/json_util.py", line 479, in _json_convert
return list((_json_convert(v, json_options) for v in obj))
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/json_util.py", line 479, in <genexpr>
return list((_json_convert(v, json_options) for v in obj))
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/json_util.py", line 477, in _json_convert
return SON(((k, _json_convert(v, json_options)) for k, v in obj.items()))
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/son.py", line 65, in __init__
self.update(data)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/son.py", line 156, in update
for k, v in other:
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/json_util.py", line 477, in <genexpr>
return SON(((k, _json_convert(v, json_options)) for k, v in obj.items()))
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/json_util.py", line 481, in _json_convert
return default(obj, json_options)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/json_util.py", line 881, in default
binval = Binary.from_uuid(obj, uuid_representation=json_options.uuid_representation)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/bson/binary.py", line 267, in from_uuid
raise ValueError(
ValueError: cannot encode native uuid.UUID with UuidRepresentation.UNSPECIFIED. UUIDs can be manually converted to bson.Binary instances using bson.Binary.from_uuid() or a different UuidRepresentation can be configured. See the documentation for UuidRepresentation for more information.

@swastis10 can you fix this? You should test by connecting the phone and server and running it end to end

@shankari
Copy link
Contributor Author

here's a related issue on the bottle repo:
bottlepy/bottle#287

you should also be able to just test using curl

swastis10 added a commit to swastis10/e-mission-server that referenced this issue Mar 11, 2023
We cannot encode native uuid.UUID with UuidRepresentation.UNSPECIFIED therefore defining uuid_representation as PYTHON_LEGACY fixes the issue
Contains fix for - e-mission/e-mission-docs#856 (comment)
swastis10 added a commit to swastis10/e-mission-server that referenced this issue Mar 11, 2023
We cannot encode native uuid.UUID with UuidRepresentation.UNSPECIFIED therefore defining uuid_representation as PYTHON_LEGACY fixes the issue.

Contains fix for - e-mission/e-mission-docs#856 (comment)
@shankari
Copy link
Contributor Author

The server upgrade is done, closing issue

shankari added a commit to shankari/e-mission-server that referenced this issue Apr 13, 2023
shankari added a commit to shankari/e-mission-server that referenced this issue May 8, 2023
…sentation to LEGACY

- Created a simple class (`json_wrappers.py`) which encapsulates the correct
  settings that we need for serializing and deserializing the JSON
  representations with the correct UUID representation
- Changed all the references in the `emission` directory to use the new wrapper functions
    - Including cleaning up prior hardcoded attempts at cleanup, such as
        e-mission@3b456e7#diff-cfddece09bbf132974a13acdfb77be4e585a4bff39999f559dc8200c1ffbe78d

Additional Context:
- e-mission/e-mission-docs#856 (comment)
- https://pymongo.readthedocs.io/en/stable/examples/uuid.html
- e-mission@9c683cf
- e-mission@6ac02a2
- e-mission@edd8b77

As of 4.3.3, the LEGACY_JSON_OPTIONS also has an UNSPECIFIED UUID representation

>  bson.json_util.LEGACY_JSON_OPTIONS: bson.json_util.JSONOptions =
>  JSONOptions(strict_number_long=False, datetime_representation=0,
>  strict_uuid=False, json_mode=0, document_class=dict, tz_aware=False,
>  uuid_representation=UuidRepresentation.UNSPECIFIED,
>  unicode_decode_error_handler='strict', tzinfo=None,
>  type_registry=TypeRegistry(type_codecs=[], fallback_encoder=None),
>  datetime_conversion=DatetimeConversion.DATETIME)¶

Testing done:
- Ensured that there were no imports of json_utils
```
$ find emission -name \*.py | xargs grep json_utils
$
```

- Ensured that all `bju` prefixes were replaced, other than in the wrapper itself

```
$ find emission -name \*.py | xargs grep bju
emission/storage/json_wrappers.py:import bson.json_util as bju
emission/storage/json_wrappers.py:wrapped_object_hook = lambda s: bju.object_hook(s,
emission/storage/json_wrappers.py:    json_options = bju.RELAXED_JSON_OPTIONS.with_options(
emission/storage/json_wrappers.py:wrapped_default = lambda o: bju.default(o, json_options = bju.LEGACY_JSON_OPTIONS)
emission/storage/json_wrappers.py:wrapped_dumps = lambda o: bju.dumps(o, json_options = bju.LEGACY_JSON_OPTIONS.with_options(
emission/storage/json_wrappers.py:wrapped_loads = lambda s: bju.loads(s)
```

- Ensured that all `esj` imports used the `wrapped` version of the name

```
$ find emission -name \*.py | xargs grep esj | egrep -v "import|esj.wrapped" | wc -l
       0
```
shankari added a commit to shankari/e-mission-server that referenced this issue May 8, 2023
…sentation to LEGACY

Changed all the references in the `bin` scripts to also use the new wrappers to
serialize/deserialize UUID objects.

This is a follow-on to
e-mission@51a6881
which also has the additional context

Additional Context:
- e-mission/e-mission-docs#856 (comment)
- https://pymongo.readthedocs.io/en/stable/examples/uuid.html
- 9c683cf
- 6ac02a2
- edd8b77

As of 4.3.3, the LEGACY_JSON_OPTIONS also has an UNSPECIFIED UUID representation

>  bson.json_util.LEGACY_JSON_OPTIONS: bson.json_util.JSONOptions =
>  JSONOptions(strict_number_long=False, datetime_representation=0,
>  strict_uuid=False, json_mode=0, document_class=dict, tz_aware=False,
>  uuid_representation=UuidRepresentation.UNSPECIFIED,
>  unicode_decode_error_handler='strict', tzinfo=None,
>  type_registry=TypeRegistry(type_codecs=[], fallback_encoder=None),
>  datetime_conversion=DatetimeConversion.DATETIME)¶

Testing done:
- Ensured that there were no imports of json_utils

```
$ find bin -name \*.py | xargs grep json_utils
$
```

- Ensured that all `bju` prefixes were replaced

```
$ find bin -name \*.py | xargs grep bju
$
```

- Ensured that all `esj` imports used the `wrapped` version of the name

```
$ find bin -name \*.py | xargs grep esj | egrep -v "import|esj.wrapped"
$
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants