SpaceEngine

SpaceEngine

GAIA DR3 Catalog Beta 0.10 RC3
Delta Ori  [developer] 27 May @ 9:31am
Technical Documentation
Overview:

The conversion pipeline has ten major steps, and nine scripts. Here I will go over each one in detail, describing what they do, and how.



Step 1: TAP Pull Requests

The first step in the process is to actually collect the needed data from the DR3 tables, which come from four sources.

First is the main Gaia DR3 table[vizier.cds.unistra.fr]. This includes the Source ID (which is the universal factor across all parts), coordinate information, apparent magnitude, proper motion, and RUWE. Of these, RUWE is the one that most people probably aren't familiar with. In very basic terms, it's a number that tells you how confident the team is that the information about that star is accurate. The higher the number, the less confident they are.

The second important table is the estimated distance catalog[vizier.cds.unistra.fr] compiled by Bailer-Jones et al. The only value taken from this table is "rpgeo", which is the estimated(!) distance of each star in the catalog from earth. The "estimated" part is important. These are not definite values, but with the information available in the data release, they're the most accurate it's going to get. Some stars in the catalog already have more accurate measurements available, but these numbers exclusively use info from Gaia rather than external sources, so their accuracy is limited.

The third table is the astrophysical parameters catalog[vizier.cds.unistra.fr] produced by the Apsis (Astrophysical Parameters Inference System) processing chain, which I won't be getting into; just know that it gives us helpful numbers. This is where most of the values used for classification come from, including effective temperature, radius, luminosity, mass, surface gravity, and metallicity. Of particular note is the SpType-ELS field, which gives a preliminary spectral classification for some stars.

The fourth table is the Apsis Supplemental[vizier.cds.unistra.fr] table, which gives essentially the same info as the main Apsis table, but from a wider array of models used to calculate those values rather than just the one that the Gaia team chose to use in the main table.

These tables all crossmatch their Source IDs with the Bailer-Jones catalog in order to allow me to pull specific range slices, which I progressively increase.

To pull these tables, I use TOPCAT[www.star.bris.ac.uk], with the following three ADQL queries (note that these are formatted specifically for Vizier, which I prefer to use due to the user-friendly UX):

SELECT gaia.Source, gaia.RA_ICRS, gaia.DE_ICRS, gaia.Gmag, gaia.RUWE, gaia.pmRA, gaia.pmDE, bj.rpgeo FROM "I/355/gaiadr3" AS gaia JOIN "I/352/gedr3dis" AS bj ON gaia.Source = bj.Source WHERE bj.rpgeo > [Starting Dist] AND bj.rpgeo < [Ending Dist] AND gaia.RA_ICRS IS NOT NULL AND gaia.DE_ICRS IS NOT NULL AND gaia.logg IS NOT NULL AND gaia.Gmag IS NOT NULL AND gaia.RUWE IS NOT NULL AND gaia.pmRA IS NOT NULL AND gaia.pmDE IS NOT NULL SELECT bj.Source, bj.rpgeo, apsis.Teff, apsis.logg, apsis.Rad, apsis."[Fe/H]", apsis.GMAG, apsis."Rad-Flame", apsis."Lum-Flame", apsis."Mass-Flame", apsis."SpType-ELS", apsis."Teff-S", apsis."logg-S", apsis."[M/H]-S", apsis."Teff-HS", apsis."logg-HS" FROM "I/352/gedr3dis" AS bj JOIN "I/355/paramp" AS apsis ON bj.Source = apsis.Source WHERE bj.rpgeo > [Starting Dist] AND bj.rpgeo < [Ending Dist] AND apsis.Teff IS NOT NULL AND apsis.logg IS NOT NULL AND apsis."Lum-Flame" IS NOT NULL AND apsis.Pstar > 0.5 AND apsis.Pbin < 0.01 AND apsis.PGal < 0.01 AND apsis.PQSO < 0.01 SELECT bj.Source, bj.rpgeo, apsup.Teff, apsup.logg, apsup."[M/H]", apsup.Rad, apsup."Teff-Phx", apsup."logg-Phx", apsup."Rad-Phx", apsup."[M/H]-Phx", apsup."Teff-OB", apsup."logg-OB", apsup."Rad-OB", apsup."[M/H]-OB", apsup."Teff-A", apsup."logg-A", apsup."Rad-A", apsup."GMAG-A", apsup.GMAG, apsup."GMAG-Phx", apsup."GMAG-OB", apsup."Teff-ANN", apsup."logg-ANN", apsup."[M/H]-ANN", apsup."Rad-Flame", apsup."Lum-Flame", apsup."Mass-Flame" FROM "I/352/gedr3dis" AS bj JOIN "I/355/paramsup" AS apsup ON bj.Source = apsup.Source WHERE bj.rpgeo > [Starting Dist] AND bj.rpgeo < [Ending Dist] AND ( apsup.Teff IS NOT NULL OR apsup."Teff-Phx" IS NOT NULL OR apsup."Teff-OB" IS NOT NULL OR apsup."Teff-ANN" IS NOT NULL ) AND ( apsup."logg" IS NOT NULL OR apsup."logg-Phx" IS NOT NULL OR apsup."logg-OB" IS NOT NULL OR apsup."logg-ANN" IS NOT NULL ) AND ( apsup.Rad IS NOT NULL OR apsup."Rad-Phx" IS NOT NULL OR apsup."Rad-OB" IS NOT NULL OR apsup."Rad-Flame" IS NOT NULL ) AND ( apsup.GMAG IS NOT NULL OR apsup."GMAG-Phx" IS NOT NULL OR apsup."GMAG-OB" IS NOT NULL OR apsup."GMAG-A" IS NOT NULL )

Update as of 05/30/2025: Gaia Main query updated to include proper motion values.



Step 2: Merging, Calculating, Converting:
1_Merge_And_Convert_V15_134_119.py

This is the largest of the eight scripts used in the process, and the first one to require an additional import to allow for RAM usage tracking and worker auto-throttling to avoid crashing. Here's how it works:

  1. Load all of the .csv files downloaded by TOPCAT, and process them one triplet at a time, ignoring any in-progress range slices that don't have all three files available yet.
  2. Immediately remove anything with missing proper motion values.
  3. Immediately remove anything with a RUWE value higher than 1.2.
  4. Convert all coordinates to Epoch J2000 for better filtering.
  5. Out of the various models available determine which one to take the used value from by first seeing if any groups of at least three have a relative consensus, then checking if the Gaia team's chosen value from Apsis has anything close that supports it (since we assume they know what they're doing), then lastly choosing one model to use based on the known strengths of that model listed in their respective documentation.
  6. Use the selected values (or any that only have one source) to calculate any other stats that would be useful to classify a star if any of those values are missing from the input tables.
  7. Apply bolometric correction to the absolute magnitude value.
  8. First pass check if the stars are valid via an extensive set of filters, and discard any with missing values or implausible combinations.
  9. Check if SpType-ELS is present and/or makes sense; use it as a base if yes or calculate base class if no.
  10. Use available values to check if a star fits within the limits, probable limits, and strict limits of each luminosity class, with each block a star passes awarding exponentially more points.
  11. Assign a luminosity class based on which class's limits block has the highest score.
  12. If any stars return physically impossible values, determine the outlier and re-run the star through classification, blocking the bad value from being used again.
  13. Use an extremely extensive set of logic filters to try and remove any stars that don't make sense.
  14. Merge the processed tables into one.
  15. Convert the updated table to the SpaceEngine bulk catalog format and export it.

Update as of 05/28/2025: Added Epoch conversion.
Update as of 05/30/2025: Added Proper Motion filter.
Update as of 06/11/2025: Updated slice processing logic.
Update as of 06/21/2025: Added Recompute step for invalid stars.
Update as of 08/18/2025: Added Bolometric Correction step and Updated Classification steps.



Step 3: Compile Source IDs:
2_Extract_Source_IDs_V1_1_1.py

As it says on the box; simply pulls all of the Gaia DR3 Source IDs from the stars in the merged catalog for later crossmatching.



Step 4: Crossmatch HIP IDs:
3_Fetch_And_Link_HIP_Crossmatch_V4_3_9.py

Yet another simple one. In the main Gaia DR3 table from earlier, they include this handy little field that links each Source ID with a HIPPARCOS catalog ID if applicable. This script uses an astroquery TapPlus query to pull those (all of them; HIPPARCOS is actually kinda small compared to Gaia) and logs them in a separate file. Despite the comparatively small size, this is still over 100,000 rows, so I do it in chunks to avoid getting timed out by the servers.

SELECT TOP {CHUNK_SIZE} Source, HIP FROM "I/355/gaiadr3" WHERE HIP IS NOT NULL AND Source > {last_max_id}



Step 5: Extract SpaceEngine Catalogs:
4_Parse_Sc_Catalogs_V3_7_4.py

The next script is pretty simple, and just pulls the necessary fields from SpaceEngine's .sc files to build a combined .csv file in the bulk format, along with any existing ones like the HIPPARCOS.csv file.



Step 6: Building A Treehouse KDTree:
5_Build_Reference_KDTree_V1_1_2.py

So... about that 16-hour filtering script run time. Basically, I decided at that point to use a KDTree to cache anything that's already been filtered once so I don't have to do it again. This script just uses cartesian coordinates to effectively map all of the SpaceEngine and addon catalogs so that it doesn't take an entire day to load them all. As long as nothing in the folder changes, it's all good. If I were to modify anything, though, then the entirety of every catalog in the filtering reference folder would need to be re-scanned.



Step 7: Prefiltering:
6_Combined_Prefilter_V1_0_0.py

This one searches through that new catalog made from the extracted .sc files from earlier, as well as the Gaia DR2 addon catalog (or catalogs, but more stars to filter out increases the run time of a later filtering script to the point that it once took sixteen hours to complete, so I decided to just use the 1M catalog) to see if any of the Source IDs match, and log them in a separate file. This used to be its own script, but ended up getting rolled into this one recently.

Since we now have all these nifty little files lying around containing crossmatched IDs, it's a super basic task to just remove all of them from the main merged catalog file before filtering for duplicates.



Step 8: Manual Sky Filtering:
7_Filter_XMatch_V1_1_1.py

Here is the only manual step in the process. Basically, I take the same reference catalogs as the next step, but do a manual Sky Algorithm Match Tables in TOPCAT with an error margin of 1.5 arcseconds. It's not a huge deal and can be skipped if I want to allow the full pipeline to run overnight or something, but it's an additional measure to make sure no duplicates slip through.



Step 9: The Actual Big Duplicate Filter:
8_Filter_Conflicting_Stars_V6_9_17.py

So, now we come back to the question of why this step is even necessary. The answer is that the Gaia DR3 catalog is too new. See, they use this thing in astronomy called an Epoch, and the simple explanation is that it tells you when the measurements in the catalog were taken. I haven't been able to 100% confirm that SpaceEngine has a unified Epoch, but the vast majority of the catalogs it pulls from are all J2000, with the exception of HIPPARCOS, which seems to have some slightly-off coordinates that line up closest with the original HIP catalog's J1991.25. This is not the case for Gaia DR3, which is in J2016, and unfortunately, stars don't tend to stand still.

In combination with the earlier-mentioned estimated distances, this is the primary driver for why SE's automatic duplicate merging doesn't work; the stars in this pack have moved so far from their positions as they are listed in the SE catalogs that the system no longer considers them to be the same star. This necessitates a conceptually simple but extremely computationally difficult script. This is the other one that requires RAM tracking.

  1. Check if any stars in a Epoch-matched HIPPARCOS catalog are within two arcseconds of a star in the DR3 catalog in case they didn't have HIP values to use in the prefilter, and delete them if so.
  2. Check if any stars in the reference catalog are within half a parsec of a star in the DR3 catalog; if yes, delete it.
  3. If no stars are in range, check literally every single star inside a 36 arcsecond cone and delete the DR3 star if there's even a single hit. This might seem like overkill, but but it was the best I could come up with after finding a star with such a ridiculous proper motion that it had moved nearly 30 parsecs.

As you can see, this is very silly, doesn't do anything if the direction of motion is perpendicular to the direction of the cone and too fast to stay inside it, and I would welcome suggestions on how to fix all of this.

Current list of reference catalogs as of 06/01/2025:
  • SE_SC_Extracted.csv
  • HIPJ2000.csv
  • HIPPARCOS.csv
  • Orion-Nebula.csv
  • Pleiades.csv
  • 1M-11mag.csv
  • White_Dwarfs_GaiaDR2.csv

Update as of 06/01/2025: Added J2000 HIP locational filter.
Update as of 06/04/2025: Added an explanation of SE's Epoch(s).



Step 10: Put It All Together:
9_Generate_Final_SE_Pack_V3_9_2.py

All that's left is to cut the file down to size so there's not hundreds of millions of stars to destroy the computer of anyone who tries to run it, which sounds simple, but actually requires a bit of nuance. If the file just gets cut off after a certain number of rows, then depending on how the ordering ended up, it might just have only stars within a certain distance range, or stars of a certain spectral type, are stars of a certain magnitude range.

The solution is simple: just randomize the entire file first, while artificially pushing star types with low population through to avoid the entire pack being the same type of star. Of course, being able to limit to a certain range is useful, so I do that first, with an option to leave range unlimited. This allows future packs to range from extremely dense populations near Sol to extremely sparse populations spanning the entire galaxy (once I get there. I'm getting about a 25 parsec range slice a day at most). Alternatively, if anyone did feel like incurring the wrath of their computer's machine-spirit, there could also be a hypothetical pack of literally the entire full Gaia DR3 catalog, though I'm pretty sure the actual program would just refuse to load at that point.

Update as of 06/21/2025: Added artificial population inflation clause.
Last edited by Delta Ori; 3 Sep @ 7:40pm
< >
Showing 1-1 of 1 comments
Delta Ori  [developer] 11 Jul @ 2:26pm 
Concerning O-Class Stars

TL;DR:
There will never be any O-Class stars unless I do a lot of work and use some additional catalogs to implement them.



So basically, I've discovered an issue with the Gaia DR3 values in that even the absolute highest temperature from the OB library for a star of SpType-ELS O is still only in the upper B-Class temperature range. This means that it is literally impossible for any O-class stars to be classified, as they will never meet the temperature requirements.

(EDIT: I've since actually found six with O-type temperatures, but their data was too messy and they all got filtered.)

Not sure why their temperatures are so under-calculated, but it seems to just be some sort of limitation with Apsis or something.

In order to actually get any in, I will need to crossmatch some Gaia IDs with another catalog (most likely The Alma catalogue of OB stars, Pantaleoni+, 2021) to make sure I'm only selecting stars that are actually O-Class. From there, I'll then need to calculate appropriate temperatures for them using other available parameters. At that point I'll either have to figure out a way to splice those back into the main catalog, or just have them be a separate dedicated O-Class star addon pack.

Concerning White Dwarfs

These are in a similar boat, in that they're hypothetically in the catalog, but because of how extreme they are, the data's too noisy to use. Like the O-types, they'll need to be crossmatched, though I'm not sure what I'm going to use for that one yet.
Last edited by Delta Ori; 21 Aug @ 6:40am
< >
Showing 1-1 of 1 comments
Per page: 1530 50