Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elf.parse: added lazy_parse function #254

Merged
merged 1 commit into from Feb 1, 2021
Merged

Conversation

jessehui
Copy link
Contributor

@jessehui jessehui commented Jan 26, 2021

We are developing a libOS kernel (https://github.com/occlum/occlum) and would like to load ELF as lazy as possible. However, now goblin does all the read and parse in memory so we must read the whole ELF file to kernel memory (we can't mmap the file to memory due to some constraint).

Thus, I came up with this method to load lazily:
read ELF header only -> parse header -> get program header table -> only read in PT_LOAD and PT_INTERP segments

And by adding these two functions, I can achieve this. The implementation can be found here. I would like to see this get merged to master so that people with similar requirements can do this. But I am not sure if it is a common need. If it is, I can spend some time to reduce the repeated code.

Comments are welcome. Thank you.

@m4b
Copy link
Owner

m4b commented Jan 28, 2021

Cool! Yea this seems like a great compromise to get a lazy parsing approach, I like it! Reducing some of the boilerplate into common function would be good, as you said, but i understand that this is just an initial sketch/PR?

What api were you thinking for filling in the rest of the structures, e.g., section headers, etc.?

And thanks for the PR! :)

@jessehui
Copy link
Contributor Author

Thank you for your reply. Glad you are interested. I push a new version to reduce the repeated code, make API more general and add a unit test which can also be an example.

What api were you thinking for filling in the rest of the structures, e.g., section headers, etc.?

With goblin's well-designed APIs for each part (program headers, section headers ...), user can read in the contents they need and parse the specific part with corresponding parse function. And then, they can put it to the dummy Elf struct returned from lazy_parse. I think the unit test can be seen as an example (here).
We will do the parsing similar for the libOS. Just we only care about PT_INTERP and PT_LOAD segments.

You may start the review. Thank you!

Copy link
Owner

@m4b m4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks actually really great! Want some clarification on the new pub functions, and also the new parse_elf_hdr, otherwise I think we can merge this very soon, and thanks again for the PR! I didn't think a lazy elf parser would be so easy;

Oh it might also be good to add a new example showing how to parse out portions in a lazy manner, etc. the test is good, but example filling in other parts of structs, just as proof of concept would be good (i assume the user has all the facilities to do this at the moment?)

The only other concern i have is the "semantics" of a lazy elf parse; i assume it's safe to use, and no unusual things will happen if the user creates an elf struct in the lazy manner (i.e., no other public methods on the struct will panic if called and it was constructed in the lazy manner, yes?)

src/elf/mod.rs Outdated
@@ -202,22 +202,45 @@ if_sylvan! {
pub fn is_object_file(&self) -> bool {
self.header.e_type == header::ET_REL
}

pub fn parse_elf_hdr(bytes: &'a [u8]) -> error::Result<Header> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a pub fn (or a function at all actually?)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to parse_header then since elf is redundant a bit since user would write Elf::parse_elf_hdr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe this function is very likely to be called directly by users who would like to load in a lazy manner. As elf header contains much information and user can just read in the header part with system IO and parse only the header to get the info and can then decide whether to read in other parts.

Renamed. Thanks for your review.

src/elf/mod.rs Outdated
@@ -346,7 +369,7 @@ if_sylvan! {
}
}

fn gnu_hash_len(bytes: &[u8], offset: usize, ctx: Ctx) -> error::Result<usize> {
pub fn gnu_hash_len(bytes: &[u8], offset: usize, ctx: Ctx) -> error::Result<usize> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be pub now? I'd like to not increase api scope unless absolutely necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was writing the test case and was thinking this could be called by users if they want to do the same as the test case.
However, I don't think it is necessary. It is a truly low-level function. Maybe we can mark this pub when there are strong demands. I am okay to keep it private for now.

Changed back to private functions. Done

src/elf/mod.rs Outdated
@@ -379,7 +402,7 @@ if_sylvan! {
}
}

fn hash_len(bytes: &[u8], offset: usize, machine: u16, ctx: Ctx) -> error::Result<usize> {
pub fn hash_len(bytes: &[u8], offset: usize, machine: u16, ctx: Ctx) -> error::Result<usize> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also feels very lowlevel as a pub function; on argument is a u16 machine value, which doesn't seem great for user to think about; if this is needed for some reason, i think we can provide something better here (still not sure why it's needed exactly, hoping you can fill that in)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -388,6 +411,38 @@ if_sylvan! {
};
Ok(nchain)
}

struct Misc {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review.

@vfsfitvnm
Copy link

I second this.
In my specific use case, I just need to parse the elf file header in order to get e_ident[EI_CLASS] and e_machine.
It would be a lot nicer if there was a straightforward way to do that (https://github.com/m4b/goblin/pull/254/files#diff-696bbf0d8b3f75177e0fa931a45c3e85057e2078598ee1cf8040393730b30633R206), instead of include scroll as a direct dependency.

Copy link
Contributor Author

@jessehui jessehui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m4b Thank you for your review. I changed gnu_hash_len and hash_len back to private functions. I can add an example later.

src/elf/mod.rs Outdated
@@ -202,22 +202,45 @@ if_sylvan! {
pub fn is_object_file(&self) -> bool {
self.header.e_type == header::ET_REL
}

pub fn parse_elf_hdr(bytes: &'a [u8]) -> error::Result<Header> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe this function is very likely to be called directly by users who would like to load in a lazy manner. As elf header contains much information and user can just read in the header part with system IO and parse only the header to get the info and can then decide whether to read in other parts.

Renamed. Thanks for your review.

src/elf/mod.rs Outdated
@@ -346,7 +369,7 @@ if_sylvan! {
}
}

fn gnu_hash_len(bytes: &[u8], offset: usize, ctx: Ctx) -> error::Result<usize> {
pub fn gnu_hash_len(bytes: &[u8], offset: usize, ctx: Ctx) -> error::Result<usize> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was writing the test case and was thinking this could be called by users if they want to do the same as the test case.
However, I don't think it is necessary. It is a truly low-level function. Maybe we can mark this pub when there are strong demands. I am okay to keep it private for now.

Changed back to private functions. Done

src/elf/mod.rs Outdated
@@ -379,7 +402,7 @@ if_sylvan! {
}
}

fn hash_len(bytes: &[u8], offset: usize, machine: u16, ctx: Ctx) -> error::Result<usize> {
pub fn hash_len(bytes: &[u8], offset: usize, machine: u16, ctx: Ctx) -> error::Result<usize> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -388,6 +411,38 @@ if_sylvan! {
};
Ok(nchain)
}

struct Misc {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review.

@jessehui jessehui force-pushed the lazy_parse branch 2 times, most recently from 1ae2c00 to 2c2794a Compare February 1, 2021 04:18
Copy link
Owner

@m4b m4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once doc comment and renamed I think this is ready to merge, thank you for the PR! This is great stuff.
It's also occurred to me in future, we could add e.g., elf.parse_section_headers() which takes &mut self; then the Elf::parse code could do:

  let mut elf = Elf::lazy_parse()?;
  elf.parse_section_headers();
  elf.parse_whatever()?;

and ditto for users of lazy parsing, easier for them to fill out which portions they want without doing the freestanding parse functions filling a field up, etc. just a thought ( definitely don't do that now) :)

src/elf/mod.rs Outdated Show resolved Hide resolved
tests/elf.rs Outdated Show resolved Hide resolved
@jessehui
Copy link
Contributor Author

jessehui commented Feb 1, 2021

It seems that I can't fill in all parts without gnu_hash_len and/or hash_len...
However, I think if users would like to go this far, maybe it is better to call parse instead of lazy_parse as it will need a lot more effort to parse almost everything.
But for users who just need some specific sections or segments, the lazy_parse can bring much convenience. I can write an example to extract the interpreter segments lazily (without reading the whole elf file). What do you think?

src/elf/mod.rs Outdated Show resolved Hide resolved
@m4b
Copy link
Owner

m4b commented Feb 1, 2021

Example getting interpreter would be great! Re gnu hash, let's leave it private for now, that way we don't commit to an api; if someone needs this functionality, we can always make a PR/take a PR and review how best to expose it.

                let mut num_syms = if let Some(gnu_hash) = dyn_info.gnu_hash {
                    gnu_hash_len(bytes, gnu_hash as usize, ctx)?
                } else if let Some(hash) = dyn_info.hash {
                    hash_len(bytes, hash as usize, header.e_machine, ctx)?
                } else {
                    0
                };
                let max_reloc_sym = dynrelas.iter()
                    .chain(dynrels.iter())
                    .chain(pltrelocs.iter())
                    .fold(0, |num, reloc| cmp::max(num, reloc.r_sym));
                if max_reloc_sym != 0 {
                    num_syms = cmp::max(num_syms, max_reloc_sym + 1);
                }
                dynsyms = Symtab::parse(bytes, dyn_info.symtab, num_syms, ctx)?;

as far as i can see, this is only code that uses it; we could just a a helper function that does all this and takes whichever minimal arguments requierd, seems like dyn_info, ctx, bytes, and dyn_relas and header 🤷

For now i think it's better to leave private, as someone can always do as you say and parse whole thing :)

Copy link
Owner

@m4b m4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thanks so much for this!

@m4b m4b merged commit 8ce726a into m4b:master Feb 1, 2021
@m4b
Copy link
Owner

m4b commented Feb 1, 2021

I can add this change onto the 0.3 branch and do (another :D ) release, 0.3.4 if you like?

@jessehui
Copy link
Contributor Author

jessehui commented Feb 1, 2021

@m4b Thank you for your review. Yeah sure, I was just going to ask whether there will be a recent release with this feature that can be used directly from crates.io?

@m4b
Copy link
Owner

m4b commented Feb 1, 2021

You got it, 0.3.4 is out!

@jessehui
Copy link
Contributor Author

jessehui commented Feb 1, 2021

Great! Thank you very much!

lazy_parse will only generate a dummy Elf struct with only Header.
Users can choose to parse whatever they want and fill the Elf based
on their needs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants